tracer built on BPF & BCC; ideal for ad-hoc instrumentation with custom one-liners
reference guide, kernel analysis with bpftrace, cheatsheet, one-liner tuts
-
contains doc, man-pages, examples; included bpftrace tools have
.btextension -
frontend uses lex & yacc to parse bpftrace, Clang for parsing structure; backend compiles to LLVM IR then BPF by LLVM libs
-
Dynamic instrumentation: kernel-level
kprobeand user-leveluprobe -
Static instrumentation: kernel-level
tracepoint. softwareand user-levelusdt via libbcc -
Timed sampling events
profile; Interval eventsinterval -
PMC events
hardware; Synthetic eventsBEGIN, END
-
Filtering
predicates; Per-event outputprintf() -
Basic Variables
global, $local, per[tid]; Built-in Variablespid, tid, comm, nsecs, .. -
Associative Arrays
key[val]; Frequency Countingcount(), ++; Statisticsmin(), max(), sum(), avg(), stats() -
Histograms
hist(), lhist(); Timestamps & deltasnsecs and hash storage -
Stack traces
kstackfor kernel &ustackfor userspace -
Symbol resolution
ksym(), kaddr()for kernel &usym(), uaddr()for userspace -
C struct navigation
->; Array access[]; Shell commandssystem() -
Printing file
cat(); Positional params$1, $2, ..
-
perfhas verbose language; supports efficient event dumping in binary format; in-kernel supparies limited to counts -
Ftracehas fewer dependencies; has instrumentation mode as function counts -
Systemtapadds its own kernel modules, proven unreliable on systems other than RHEL -
LTTnghas optimized event dumping & analysis tools of dump -
Application toolslimited to user-level visibility; like JVM profiler & MySQL DB profiler
-
CPU: execsnoop, runqlat, runqlen, cpuwalk, offcputime
-
Memory: oomkill, failts, vmscan, swapin
-
File Systems: vfsstat, filelife, xfsdist
-
Storage I/O: biosnoop, biolatency, bitesize, biostacks, scsilatency, nvmelatency
-
Networking: tcpaccept, tcpconnect, tcpdrop, tcpretrans, gethostlatency
-
Security: ttysnoop, elfsnoop, setuids
-
Languages: jnistacks, javacalls
-
Applications: threadsnoop, pmheld, naptime, mysqld_qslower
-
Kernel: mlock, mheld, kmem, kpages, workq
-
Containers: pidnss, blkthrot
-
Hypervisors: xenhyper, cpustolen, kvmexits
-
Debugging/multi-purpose: execsnoop, threadsnoop, opensnoop, killsnoop, signals
-
list probes:
sudo bpftrace -l 'tracepoint:syscalls:sys_enter_*' -
whose executing what
% sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s %s\n", comm, str(args->argv)); }'
Attaching 1 probe...
vim /home/username/bin/go-1.16.7/bin/sh
...
- new processes with args
% sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve { join(args->argv); }'
Attaching 1 probe...
git ls-files --error-unmatch --full-name -z -- chapter-05.md
/bin/bash -c /usr/bin/zsh
...
-
files opened:
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }' -
count syscalls by program:
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' -
count syscalls by process:
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[pid, comm] = count(); }' -
count syscalls by probe name:
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }' -
total
read()bytes by process:sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret/ { @[comm] = count(); }' -
distribution of
read()bytes by process:sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes = hist(args->ret); }'
filtered by PID 6763 would be
% sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /pid == 6763/ { @bytes = hist(args->ret); }'
Attaching 1 probe...
^C
@bytes:
[0] 5 |@ |
[1] 4 | |
...
- Kernel dynamic-tracing of read bytes
sudo bpftrace -e 'kretprobe:vfs_read { @bytes = lhist(retval, 0, 2000, 200); }'
above
lhist()takesvalue, min, max, stepas params for histogram
- timing
read()in nanoseconds summarized by process name
% sudo bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; } kretprobe:vfs_read /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
^C
@ns[systemd-journal]:
[4K, 8K) 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
@ns[libinput-connec]:
[1K, 2K) 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K) 0 | |
[4K, 8K) 0 | |
[8K, 16K) 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
...
-
histogram of block I/O requests by size in bytes:
sudo bpftrace -e 'tracepoint:block:block_rq_issue { @ = hist(args->bytes); }' -
trace disk I/O size by process
% sudo bpftrace -e 'tracepoint:block:block_rq_issue { printf("%d %s %s\n", pid, comm, str(args->bytes)); }'
Attaching 1 probe...
6489 kworker/u16:1
6489 kworker/u16:1
419 NetworkManager
-
count pages paged in by process:
sudo bpftrace -e 'software:major-faults:1 { @[comm] = count(); }' -
count page faults by process:
sudo bpftrace -e 'software:faults:1 { @[comm] = count(); }' -
profile user-level stacks at 99Hz:
sudo bpftrace -e 'profile:hz:45 { @[ustack] = count(); }'
profile FOR 1second:
sudo bpftrace -e 'profile:hz:45 { @[ustack] = count(); } interval:s:1 { exit(); }'profile FOR pid 6763:
sudo bpftrace -e 'profile:hz:45 /pid == 6763/ { @[ustack] = count(); }'
-
count process level events for 5sec
sudo bpftrace -e 'tracepoint:sched:sched* { @[probe] = count(); } interval:s:5 { exit(); }' -
count stacktraces leading to context-switch (off-CPU) events:
sudo bpftrace -e 'tracepoint:sched:sched_switch { @[kstack] = count(); }'
% sudo bpftrace chapter-05-kernel-struct-tracing.bt
Attaching 1 probe...
open path: meminfo
open path: statm
...
can be run with
bpftrace; or just make executable if shebang is present
-
series of probes with associated actions and optional filters as
probes { actions },probe /filter/ { actions } -
or just
/pattern/ { actions } -
comments are C-like
//&/* ... */ -
C struct navigation
->; C structs can be declared in some cases -
probe format
type:identifier[:identifier[...]]; multi-purpose probes can be specified with comma-separators to execute same actions -
probe wildcards like
kprobe:vfs_*; instrumenting too many probes may cost bad overhead so managed byBPFTRACE_MAX_PROBESenv var default:512 -
filter like equal
/pid == 1/; or just/pid/to ensure it's non-zero; boolean operators are allowed as/pid > 100 && pid < 1000/ -
action can be multiple separated by semicolons
-
BEGIN{ action }&/orEND{ action }blocks are allowed in one-liner and indented code blocks -
built-in functions be
printf(),exit(),str(char *),system(format [, args ..])(run cmd at shell) -
built-in variables are usually read-only; include
pidfor proc-id,commfor name,nsecsfor timesamp,curtaskfor address of current thread'stask_struct -
scratch variables used for temp calculations having prefix
$; name and type is set on first assignment -
map variables use BPF map storage object having prefix
@; key can be provided with one/more elements using maps as a hash table
@start[tid] = nsecs; // simple association
@path[pid, $fd] = str(arg0); // multi-key association
- map functions are special functions that can be assigned to maps or operate on them
@a = count(); // using per CPU map becoming special object of type count; allowing @a++
@b = sum($a); // sums $a variable, when printed gives total
@c = hist($a); // stores $a in a power-of-two histogram, printing bucket counts & ASCII histogram
print(@x); // printing map @x; all maps are auto-printed on BPF termination
delete(@start[tid]); // deletes given key & its value pair
clear(@x); // clears map @x
- running
bpftraceshows its help;-d&&-ddfor debug dry-run & verbose debug
-o fileto redirect output to a file
-e programexecute a program block
-p PIDenable USDT probes on PID
-
tracepoints usually provide args (for info); e.g.
tracepoint:net:netif_rxhave fieldlenfor packet length accessible usingargs->len -
tracepoint:syscalls:sys_enter_read&tracepoint:syscalls:sys_exit_readinsrument start and end ofread()system calls -
its args can be checked using
-lfor list &-vfor verbose; as
% sudo bpftrace -lv 'tracepoint:syscalls:sys_enter_read'
tracepoint:syscalls:sys_enter_read
int __syscall_nr
unsigned int fd
char __attribute__((user)) * buf
size_t count
% sudo bpftrace -lv 'tracepoint:syscalls:sys_exit_read'
tracepoint:syscalls:sys_exit_read
int __syscall_nr
long ret
- trace enter/exit of
clone()syscall that creates new processes; it might have multiple exits/returns (for parent & child processes)
% sudo bpftrace -e 'tracepoint:syscalls:sys_enter_clone {printf("-> clone() by %s PID %d\n", comm, pid); }
tracepoint:syscalls:sys_exit_clone { printf("<- clone() return %d, %s PID %d\n", args->ret, comm, pid); }'
Attaching 2 probes...
-> clone() by vim PID 1216
<- clone() return 9334, vim PID 1216
<- clone() return 0, vim PID 9334
- show
execve()syscallsudo bpftrace -e 't:syscalls:sys_enter_clone {printf("%s \t %s; PID %d\n", probe, comm, pid); }'
- can instrument binaries/shared-libs;
sudo bpftrace -p ${PID} -lshows all probes available over a Process (grep forusdt)
% sudo bpftrace -p 1087 -l | grep usdt ## for shell process
usdt:/proc/1087/root/usr/lib/ld-2.33.so:rtld:init_complete
usdt:/proc/1087/root/usr/lib/ld-2.33.so:rtld:init_start
...
-
arguments for
kprobearearg0, arg1, ...argNas unsigned 64-bit int; if pointer to a C struct then cast it (future BTF makes it automatic) -
arguments for
kretprobeisretvalbuilt-in for return value which is also uint64
-
similar to usdt, binary/lib path with function name as
uprobe:/bin/bash:readline&uretprobe:/bin/bash:readline -
similar to kprobe/kretprobe;
arg0, arg1..arguments foruprobe&retvalforuretprobe
-
format of
software:event_name:count,software:event_name:,hardware:event_name:count,hardware:event_name: -
s/w events similar to tracepoints but suited for count; h/w events are selection of PMCs for processor-level analysis
-
to avoid overhead of too many events; sampled on count
-
available s/w & h/w events depend on kernel version; some are
Software Events
cpu-clock(alias: cpu, default count: 1000000); cpu wall-time clock
task-clock(default: 1000000); cpu task clock (increments only when task is on-CPU)
page-faults(alias: faults, default: 100); page faults
context-switches(alias: cs, default: 1000); context switches
cpu-migrations(default: 1); cpu thread migrations
minor-faults(default: 100); Minor page faults: satisfied by memory
major-faults(default: 1); Major page faults: satisfied by storage I/O
alignment-faults(default: 1); Alignment faults
emulation-faults(default: 1); Emulation faults
dummy(default: 1); Dummy event for testing
bpf-output(default: 1); BPF output channelHardware Events
cpu-cycles(alias: cycles, default: 1000000); CPU clock cycles
instructions(default: 1000000); CPU instructions
cache-references(default: 1000000); CPU last level cache references
cache-misses(alias: branches, default: 100000); Branch instructions
bus-cycles(default: 100000); Bus cycles
frontend-stalls(default: 1000000); Processor frontend stalls (e.g. instruction fetches)
backend-stalls(default: 1000000); Processor backend stalls (e.g. data loads/stores)
ref-cycles(default: 1000000); CPU reference cycles (e.g. unscaled by turbo)
- these are timer based events:
profile:hz:rate,profile:s:rate,profile:ms:rate,profile:us:rate,interval:s:rate,interval:ms:rate
3 type tests in bpftrace: filters, ternary, if statements
based on boolean expressions
==, !=, >, <, >=, <=, &&, ||
-
Filter:
probe /filter/ { action;s } -
Ternary Operators:
test ? true_stmt : false_stmt -
If Statements:
if (test) { true_stmt },if (test) { true_stmt } else { false_stmt }
currently
else if { .. }statements are not supported
- Unrolled Loops: avoids infinite loops; supports
unroll (count) { statements }
count is an integer literal (constant) with max 20
Linux 5.3 kernel includes support for BPF bounded loops
-
assignment
=; arithmetic+, -, *, /; auto-increment/decrement++, -- -
binary and/or/xor/not
&, |, ^, ! -
shift left/right
<<, >> -
compound operator
+=, -=, *=,/=, %=, &=, ^=, <<=, >>=
Process Id
pid, Thread Idtid, User Iduid, User nameusername, Processor Idcpu, Process namecomm,Timestamp (nanosec)
nsecs, Timestamp (nanosecs) since bpftrace initelapsed, Kernel Task Struct as uint64curtask,Kernel Stacktrace
kstack, User Stacktraceustack, Traced Functionfunc, Probe Nameprobe, Cgroup Idcgroup,Some probe type arg
arg0, ..argNand other arg asargs, Return Value for some probe typesretval, Positional args$1,..$N
- printing who's calling
setuid()syscall
% sudo bpftrace -e 't:syscalls:sys_enter_setuid {printf("setuid by PID: %d (%s),\t UID %d\n", pid, comm, uid); }'
- can trace return value by sys_enter_exit; where '-1' is failure
% sudo bpftrace -e 't:syscalls:sys_exit_setuid {printf("setuid by PID %d (%s): UID %d\n", pid, comm, args->ret); }'
- kstack/ustack return kernel/user level stack traces as multi-line string, return upto 127 frames
% sudo bpftrace -e 't:block:block_rq_insert {printf("Block I/O by %s\n", kstack); }'
Attaching 1 probe...
Block I/O by
dd_insert_requests+468
dd_insert_requests+468
blk_mq_sched_insert_requests+98
...
- stack built-ins can also be used as keys in map, freq to be counted
% sudo bpftrace -e 't:block:block_rq_insert {@[ustack] = count(); }'
Attaching 1 probe... ^C
@[ fsync+59 ]: 6
@[]: 180
- passed to program on cmd-line, accessible at position as
$1; usage likewatchconn.bt 101for passing pid 101
% sudo bpftrace -e 'BEGIN {printf("Hey %s!\n", str($1)); }' BPF
Attaching 1 probe...
Hey BPF!
^C
-
Scratch like
$duration_usused for temporary eval in chapter-05-measure-vfs-read-time.bt -
Map like
@n,@n[k],@n[k1 [,k2 [,..]]]; for storage using hash tables for consistent key-val types
Print formatted
printf(char *fmt [, ...]), Print formatted timetime(char *fmt), Print space joined string arrayjoin(char *arr[]),Return string from pointer with optional length
str(char *s [, int len]),Return kernel/user stack to limit frames deep
kstack(int l)&ustack(int l),Resolve kernel/userspace address to return string symbol using
ksym(void *p)&usym(void *p)resp.,Resolve kernel/userspace symbol to an address using
kaddr(char *name)&uaddr(char *name)resp.,Returns value stored in named register
reg(char *name), Returns string for IPntop([int af,] int addr),Prints contents of file
cat(char *file), Exits bpftraceexit();cgroupid(char *path)to resolve Cgroup Id
- similar to C syntax; allows escape sequences; field descriptions
%[-][width]typewhere-makes left-justified (default: right)
uint
%u, int%d, ulong%lu, long%ld, unsigned long long%llu, long long%lld, ushort%hu, short%hdhexadecimal: uint
%x, ulong%lx, ulong long%llxchar
%c, string%s
- example to print 16-char wide
comm& 6-char wide left-justifiedpidviaprintf("%16s %-6d\n", comm, pid);
-
e.g. attempted execution of commands with their args using
sudo bpftrace -e 't:syscalls:sys_enter_execve { join(args->argv); }' -
e.g. checking if execution attempt succeeded
% sudo bpftrace -e 't:syscalls:sys_exit_execve { printf("%s %-d\n", comm, args->ret); }'
a failed attempt means invocation of a command failed, not necessarily for failure faced due to args within the command
-
default string limit is 64bytes; tunable via
BPFTRACE_STRLEN -
max allowed 200 bytes until stored in BPF stack; will increase once moved to BPF maps
-
similar to kstack/ustack, with accepting limit:
kstack(limit),kstack(mode[, limit]),ustack(limit),ustack(mode[, limit]) -
show top3 kernel frames that led to creating block I/O; max limit is 1024 frames
% sudo bpftrace -e 't:block:block_rq_insert { @[kstack(3), comm] = count(); }'
Attaching 1 probe... ^C
@[
dd_insert_requests+468
dd_insert_requests+468
blk_mq_sched_insert_requests+98
, kworker/u16:1]: 1
...
mode arg allows stack output to be formatted differently; supported default
bpftrace&perf
% sudo bpftrace -e 'k:do_nanosleep { printf("%s", ustack(perf)); }'
Attaching 1 probe...
7fe6f239ba95 __GI___clock_nanosleep+117 (/usr/lib/libc-2.33.so)
...
-
e.g. a function pointer arg as
@[args->function] = count() ;can display names if used as@[ksym(args->function)] = count() ; -
usymrelies on symbol table in binaries for symbol lookup
- can look for a user-space symbol say
ps1_promptwhen a function is called saybash:readlineby dereferencing assudo bpftrace -e 'uprobe:bash:readline { printf("%s\n", str(*uaddr("ps1_prompt"))); }'
- generally address space of a pointer gets inferred; but there are some corener cases as kernel func dealing with userspace pointers
% sudo bpftrace -e 'k:do_sys_open { printf("%s\n", str(uptr(arg1))); }' ## usage
- executes command at shell, syntax
system(char *fmt [, args...])and require--unsafeoption
example, calling
psto print details on PID callingnanosleep()
sudo bpftrace --unsafe -e 't:syscalls:sys_enter_nanosleep { system("ps -p %d\n", pid); }'
-
if used for a frequent event, it will load up more processes per event onto CPU causing overload
-
signalrequires unsafe as well and can send a specified signal to current tasksudo bpftrace -e 'k:f { signal("KILL"); }' -
catprints file content as
% sudo bpftrace -e 't:syscalls:sys_enter_sendmsg { printf("%s => ", comm);
cat("/proc/%d/cmdline", pid);
print("\n"); }'
Attaching 1 probe...
systemd-timesyn => /usr/lib/systemd/systemd-timesyncd
systemd-udevd => /usr/lib/systemd/systemd-udevd
...
sudo bpftrace -e 't:syscalls:sys_enter_read { @reads = count() } interval:s:5 { exit(); }'
Count occurences
count(); Sum the valuesum(int n); Averages the valueavg(int n);Minimum value
min(int n); Maximum valuemax(int n); Print map with optional limitsprint(@m [,top [,div]])Return count, avg & total
stats(int n); Print pow-2 histogramhist(int n); Print linear histogramlhist(int n, min, max, step);Delete map k/v pair
delete(@m[k]); Delete all map keysclear(@m); Sets all map values to zerozero(@m)
- examples
% sudo bpftrace -e 't:block:* { @[probe] = count() } interval:s:5 { exit(); }'
Attaching 19 probes...
@[tracepoint:block:block_plug]: 1
@[tracepoint:block:block_unplug]: 1
...
- e.g. print per-interval rate
% sudo bpftrace -e 't:block:block_rq_* { @[probe] = count() } interval:s:5 { print(@); print("\n"); clear(@); }'
Attaching 7 probes...
@[tracepoint:block:block_rq_issue]: 5
@[tracepoint:block:block_rq_complete]: 5
@[tracepoint:block:block_rq_insert]: 1
@[tracepoint:block:block_rq_issue]: 8
@[tracepoint:block:block_rq_complete]: 9
...
basic functionality can also be done via
perf stat
-
total bytes read
% sudo bpftrace -e 't:syscalls:sys_exit_read /args->ret > 0/ { @bytes = sum(args->ret); }' -
error counts are better with freq count of error codes
% sudo bpftrace -e 't:syscalls:sys_exit_read /args->ret < 0/ { @[- args->ret] = count(); }'
Attaching 1 probe...^C
@[11]: 9 ### Error Code: 11 (EAGAIN) is for Try Again in read()
- printing time in microsecond using
div
% sudo bpftrace -e 'k:vfs_read { @start[tid] = nsecs; }
kr:vfs_read /@start[tid]/ { @us[comm] = sum(nsecs - @start[tid]); delete(@start[tid]); }
END { print(@us, 0, 1000); clear(@us); clear(@start); }'
Attaching 3 probes...^C
@us[Monitor thread]: 2
@us[systemd-journal]: 5
@us[ls]: 12
@us[git]: 257
@us[tail]: 14356
...
- experimental in Linux 5.3; C style
whileloops as
% sudo bpftrace -e 'i:ms:100 { $i = 0; while($i <= 10) {printf("%d ", $i); $i++;} exit();}'
Attaching 1 probe...
0 1 2 3 4 5 6 7 8 9 10
plyBPF front-end provides high-level interface without LLVM/Clang depenedency (but struct navigation & including header files not possible)
-
bpftrace language defined by lex & yacc files; output an AST
-
Tracepoint & Clang parsers process structs; Semantic Analyzer checks use of language
-
AST nodes converted to LLVM IR, finally compiled to BPF bytecode
printf()could be used to leave breadcrumbs in run
printf("$duration_ms = (%d - %d) / 1000000\n", nsecs, @start[tid]);
-dprints AST & LLVM IR;-vprints BPF bytecode; there is also-ddfor verbose debug
bpftrace -d -e 'k:vfs_read { @[pid] = count(); }'