Tracing when app waits on disk I/O, locks or similar.
- I/O stack
,---,.____________,-----,.___________,--------,._____Raw_I/O_________,----------,
|App|:-[Runtime]:-|POSIX|:-[SysLibs]-|SysCalls|:-[VFS]-[FS]-[VolMgr]-|DiskDevice|
'---' '-----' '--------' Logical I/O~~~~~~~ |SubSystem |
'----------'
* many I/O gets served from FS cache, never makes phys.I/O
* VFS is generic kernel interface allowing different FS supported via SysCalls
- FS Caches
[Page Cache]<---->[ VFS ]<-,--->[Dir Cache]
[Page Scanner] [ext4][xfs][..] '--->[inode Cache]
|, |, |,
[Block Device Interface]<--->[Disks]
* Page Cache grows to be largest as caches file object, also includes 'dirty' pages
-
Read Aheaddetects a sequential read, predicts next and load into page cache.readaheadsyscall. -
Write Backflush dirtied buffers to disk by kernel worker threads, not blocking app on slow disk. -
BPF observability for FS requests (count by type); FS read sizes; Sync I/O write volume; File access workload; File causing issues; FS latency distribution; Page Cache (,Dcache & Icache) hit/miss; more.
-
Instrumentable I/O types:
- App & Lib I/O:
uprobes- SysCall I/O:
syscalls tracepoints- FS I/O:
ext4 (..) tracepoints, kprobes- Cache hit (reads, write-back) & miss (reads, write-through):
kprobes- Page Cache write-back:
writeback tracepoints- Physical Disk I/O:
block tracepoints,kprobes. Raw I/O:kprobes.
VFSis used heavily by many n/w I/O paths, irresponsible tracing can cause Overload. For Physical I/O tracing, check activity viaiostatto gauge overload.
-
Check mounted FS,
dformount. Check capacity. Try create some FS workload,fio. Useopensnoopto see files accessed &filelifefor short-lived files related issues. -
Use
fileslower(orext4slower,btrfsslower,zfsslower) to examine proc & file details for slow I/O. -
Check FS latency distribution;
ext4dist(orbtrfsdist,zfsdist, etc). -
cachestatto check page cache hit ratio over time, if any tuning improves it. -
Comparing
vfsstatlogical I/O rates to physical I.O rates fromiostat. Grater logical rates indicate effective caching.
-
dfshowing filesystems, mount points with their capcity.mountshowing mount flags as well. -
stracegrabs FS ops from syscall traces. E.g.strace cksum -tttT /usr/bin/cksumuses-tttto print wall timestamps &-Tfor syscall duration.stracemay slow system down. -
perfexamples:perf trace cksum /usr/bin/cksumuses kprobes to inspect VFS, more efficiently.perf stat -e 'ext4:*' -acounting all ext4 calls (depends on what tracepoints are supported by FS). -
fatracespecial tracer for FANotify (file access); showing path, event, PID, proc name & status. Overload for busy systems.opensnoopis efficient alternative.
,--------------------------,
| App | (tools from BCC & bpftrace)
|--------------------------|
| SysCall Interface |<--opensnoop, statssnoop, syncsnoop, mmapfiles
|----------,---------------| scread
| Rest of | [VFS ]-|<--filetop, filelife, fileslower, filetype, fmapfault
| | | vfssize, vfscount, vfsstat, writesync, fsrwstat
| | [FileSystems]-|<--mountsnoop, writeback, ext4dist, xfsslower,..
| Kernel | [VolManager ]-|
| | [BlockDevice]-| also bufgrow, dcsnoop, dcstat, cachestat
|----------:---------------|
| Device Drivers |
'--------------------------'
- BCC's
opensnoop -Ttraces file opens with-Tgiving timestamps.-xshows failure only &-p $PIDfor proc specific. Code for similar
#!bpftrace
BEGIN {
printf("Tracing open syscalls... Hit Ctrl-C to end.\n");
printf("%-6s %-16s %4s %3s %s\n", "PID", "COMM", "FD", "ERR", "PATH");
}
tracepoint:syscalls:sys_enter_open, tracepoint:syscalls:sys_enter_openat {
@filename[tid] = args->filename;
}
tracepoint:syscalls:sys_exit_open, tracepoint:syscalls:sys_exit_openat
/@filename[tid]/ {
$ret = args->ret;
$fd = $ret > 0 ? $ret : -1;
$errno = $ret > 0 ? 0 : - $ret;
printf("%-6d %-16s %4d %3d %s\n", pid, comm, $fd, $errno,
str(@filename[tid]));
delete(@filename[tid]);
}
END {
clear(@filename);
}
-
statsnoopshows file statistics forsyscall.stat. Supports-x&-p $PID. Using tracepointstracepoint:syscalls:sys_enter_statfs,tracepoint:syscalls:sys_enter_statx,tracepoint:syscalls:sys_enter_newstat&tracepoint:syscalls:sys_enter_newlstat. -
syncsnooptracessyncfor flush of dirty data to disk. Code for similar
#!bpftrace
BEGIN {
printf("Tracing sync syscalls... Hit Ctrl-C to end.\n");
printf("%-9s %-6s %-16s %s\n", "TIME", "PID", "COMM", "EVENT");
}
tracepoint:syscalls:sys_enter_sync, tracepoint:syscalls:sys_enter_syncfs,
tracepoint:syscalls:sys_enter_fsync, tracepoint:syscalls:sys_enter_fdatasync,
tracepoint:syscalls:sys_enter_sync_file_range, tracepoint:syscalls:sys_enter_msync {
time("%H:%M:%S ");
printf("%-6d %-16s %s\n", pid, comm, probe);
}
mmapfilestracesmmap. Code similar to
#!bpftrace
#include <linux/mm.h>
kprobe:do_mmap {
$file = (struct file *)arg0;
$name = $file->f_path.dentry;
$dir1 = $name->d_parent;
$dir2 = $dir1->d_parent;
@[str($dir2->d_name.name), str($dir1->d_name.name),
str($name->d_name.name)] = count();
}
-
screadtraces syscallread(a syscall based view,filetopis VFS based view). -
fmapfaulttrace mmap files' page faults. Syscall's r/w variant based tools escaped by file mappings. Similar code
#!bpftrace
#include <linux/mm.h>
kprobe:filemap_fault {
$vf = (struct vm_fault *)arg0;
$file = $vf->vma->vm_file->f_path.dentry->d_name.name;
@[comm, str($file)] = count();
}
-
filelifeshows AGE of files, allowing check short-lived files. Supports-p $PID. Similar code -
fsrwstatcustomizevfstatto include FS type. -
fileslowertraces file r/w slower than threshold.filetopfor frequently r/w filenames. -
cachestatfor hit/miss ratio; check if tunings are working. -
dcstatshows dir entry cachedcachehit/miss stats.dcsnoopfor tracing dcache lookups. -
icstatfor Inode cache hit.miss stats.bufgrowfor insight into buffer cache.readaheadfor efficiency.
-
stackcount -p $PID 't:syscalls:sys_enter_openat'counting stacks for file open tracepoint. -
File created via
trace 't:syscalls:sys_enter_creat "%s", args->pathname'. -
Count read
funccount 't:syscalls:sys_enter_*read*'. Write withfunccount 't:syscalls:sys_enter_*write*'. -
Read errors count
argdist -C 't:syscalls:sys_exit_read():int:args->ret:args->ret<0'. -
Reads to device via
stackcount -P read_pages. Ext4'sstackcount -P ext4_readpages.