User Kernel
========================== ============================================
static tracing
[BPF Program]----------------->[__BTF__] ,---[Sockets]
'->[BPF bytecode]<-----------' ^|, / ,---[Tracepoints]
'--------------->[Verifier] / / ,---[User Markers]
|, / / /
[Event Config].................[_______BPF_____]:::::::. dynamic tracing
|, '| |' \ \ \---[kprobes]
<--[Per event data]<-----------[perf buffer] | | \ \---[uprobes]
output ,| |, \
<---[Stats, stacks]<---------->[_Maps__[stacks]_] \ sampling, PMCs
'----[perf_events]
-
classic BPF was a limited VM with 2 registers, a scratch memory with 16 slots & a program counter
-
all operating with 32 bit register size
-
implementation later got a JIT-to-native code compiler with interpreter, improving perf
-
there is no VM layer, JIT compiled code runs directly in Kernel
-
eBPF brought changes
added more registers (now 10; R0-R9 and R10 as read-only frame pointer)
switched to 64-bit words
created infinite flexible BPF map storage; and 512 bytes stack space
enabled calls to some earlier restricted Kernel code
-
BPF programs should be safe and finish in bounded time
-
bitehistshows size of disk I/O as histogram; when generates histogram is kernel context avoids copying data to userspace & reprocessing hence performant -
BPF programs are verified hence safer than kernel-loadable modules; alongwith low skill required to adapt at wider scale
-
LLVMsupports BPF as compilation target; thus theoretically any LLVM supported high-level lang can be used to compile into BPF where LLVM further optimizes BPF instructions -
bpftoolallow viewing/manipulating BPF objects w/ programs & maps
currently
bpftooloperates onOBJECT := { prog | map | link | cgroup | perf | net | feature | btf | gen | struct_ops | iter }
bpftool perfshows programs attached viaperf_event_open()
bpftool prog showlists all programs
# sudo bpftool prog show
21: cgroup_device tag f958f6eb72ef5af6 gpl
loaded_at 2021-06-15T18:08:38+0530 uid 0
xlated 456B jited 276B memlock 4096B
22: cgroup_skb tag 6deef7357e7b4530 gpl
loaded_at 2021-06-15T18:08:38+0530 uid 0
...
...
xlated 64B jited 54B memlock 4096B
1795: tracing name do_sys_open tag 871b7e76e78a9993 gpl
loaded_at 2021-06-18T01:21:11+0530 uid 0
xlated 592B jited 433B memlock 4096B map_ids 19
btf_id 9
the
bpftraceprogram IDs here are21,22,..withoutbtf_idattrib & BCC program IDs1795withbtf_idattrib
- dump BPF program instructions in assembly (using
xlatedmode) as
# sudo bpftrace prog dump xlated id 21
0: (61) r2 = *(u32 *)(r1 +0)
1: (54) w2 &= 65535
...
# sudo bpftool prog dump xlated id 1795
int kretfunc__do_sys_open(long long unsigned int * ctx):
; KRETFUNC_PROBE(do_sys_open, int dfd, const char __user *filename, int flags, int mode, int ret)
0: (bf) r6 = r1
; KRETFUNC_PROBE(do_sys_open, int dfd, const char __user *filename, int flags, int mode, int ret)
1: (79) r1 = *(u64 *)(r6 +32)
2: (7b) *(u64 *)(r10 -304) = r1
...
# sudo bpftool prog dump xlated id 1795 linum
int kretfunc__do_sys_open(long long unsigned int * ctx):
; KRETFUNC_PROBE(do_sys_open, int dfd, const char __user *filename, int flags, int mode, int ret) [file:/virtual/main.c line_num:52 line_col:0]
0: (bf) r6 = r1
; KRETFUNC_PROBE(do_sys_open, int dfd, const char __user *filename, int flags, int mode, int ret) [file:/virtual/main.c line_num:52 line_col:1]
1: (79) r1 = *(u64 *)(r6 +32)
2: (7b) *(u64 *)(r10 -304) = r1
...
when used with BTF compiled program as
1795, includes source infowhen appended with
linummodifier it includes source file & line number info as wellsimilarly
opcodesmodifier includes BPF instruction opcodes ANDvisualmodifier emits flow graph info in DOT format
bpftool prog dump jited id <id>shows machine code to execute
-
sudo bpftool btf dump [prog] id <id>dumps BTF IDs for BCC, with typedefs & struct info -
tcpdumpemits BPF instructions with-d;bpftracedoes it with-v
BPF program can only call provided helper functions; some of which are
bpf_map_lookup_elem(map,key),bpf_map_update_elem(map,key,val,flags),bpf_map_delete_elem(map,key)
bpf_probe_read(dst,size,src),bpf_probe_read_str(dst,size,src),bpf_trace_printk(fmt,fmt_size,...)
bpf_ktime_get_ns(),bpf_spin_lock(lock),bpf_spin_unlock(lock),bpf_get_current_task()
bpf_get_current_pid_tgid(),bpf_get_current_comm(buf,buf_size),bpf_get_current_cgroup_id()
bpf_get_stackid(ctx,map,flags),bpf_perf_event_output(ctx,map,data,size),bpf_perf_event_read_value(map,flags,buf,size)here,
bpf_probe_read()helps read kernel memory outside of BPF with safety
- some of BPF syscalls userspace can invoke are
BPF_MAP_CREATE,BPF_MAP_LOOKUP_ELEM,BPF_MAP_UPDATE_ELEM,BPF_MAP_DELETE_ELEM,BPF_MAP_GET_NEXT_KEY
BPF_PROG_LOAD,BPF_PROG_ATTACH,BPF_PROG_DETACH,BPF_OBJ_PIN
- made syscalls can be inspected as
strace -ebpf <program>
# sudo strace -ebpf bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERF_EVENT_ARRAY, key_size=4, value_size=4, max_entries=8, map_flags=0, inner_map_fd=0, map_name="printf", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0, btf_vmlinux_value_type_id=0}, 120) = 3
Attaching 1 probe...
bpf(BPF_MAP_UPDATE_ELEM, {map_fd=3, key=0x7ffe2583400c, value=0x7ffe25834010, flags=BPF_ANY}, 120) = 0
bpf(BPF_MAP_UPDATE_ELEM, {map_fd=3, key=0x7ffe2583400c, value=0x7ffe25834010, flags=BPF_ANY}, 120) = 0
- BPF Program Types specify the type of events that BPF program attaches to & args for events
main types:
BPF_PROG_TYPE_KPROBE,BPF_PROG_TYPE_TRACEPOINT,BPF_PROG_TYPE_PERF_EVENT,BPF_PROG_TYPE_RAW_TRACEPOINTother types:
BPF_PROG_TYPE_SOCKET_FILTER,BPF_PROG_TYPE_SCHED_CLS,BPF_PROG_TYPE_XDP,BPF_PROG_TYPE_RAW_CGROUP_SKB
- BPF Map Types:
BPF_MAP_TYPE_HASH,BPF_MAP_TYPE_ARRAY,BPF_MAP_TYPE_PERF_EVENT_ARRAY,BPF_MAP_TYPE_PERCPU_HASH,BPF_MAP_TYPE_PERCPU_ARRAY,BPF_MAP_TYPE_STACK_TRACE,BPF_MAP_TYPE_STACK
more special purpose maps at
bpf.h
-
Linux 5.1 added Spin Lock helpers but not available for use in Tracing programs
-
Tracing frontends use Per-CPU Hash & Array Map types avoiding corruption by concurrent R/W overlaps causing lost updates
# sudo strace -febpf bpftrace -e 'k:vfs_read { @ = count(); }'
strace: Process 921230 attached
[pid 921230] +++ exited with 0 +++
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_ARRAY, ... <<<--- using PER CPU HASH
'k:vfs_read { @++; }'uses normal hash
- using both at same time shows undercounted events in normal hash
# sudo bpftrace -e 'k:vfs_read { @cpuhash = count(); @hash++; }'
Attaching 1 probe... ^C
@cpuhash: 9060
@hash: 9003
- other used ways are
BPF_XADDfor atomic map update, BPF spin locksbpf_spin_lock(), atomic race-freebpf_map_update_elem()
-
conventionally
/sys/fs/bpfmounted virtual FS exposing BPF programs & maps; allows persistent runs event after loaders exit -
user-level can also R/W these maps to interact
-
BPF_PIN_FDcan be used to pin a BPF Object to make persistent at a path & available; to keep it loaded even after detached
-
BTF is a metadata format encoding debugging info for BPF programs/maps/more
-
Inspection & Teacing tools use this to pretty-print details instead of raw hex dump
Compile Once - Run Everywhereaims compiling to BPF bytecode once so it can be releasable dist; avoid need of BPF compiler be everywhere
-
limited kernel functions that can be called; limits on loops
-
MAX_BPF_STACKsize is 512; unprivileged BPF program instruction count to 4096 & privileged to 1mil
- BPF provides special map types to record stacktraces; fetched using
frame pointer-basedorORC-basedstack walks
-
head of linked-list of stack frames found in a register
RBP on x86_64, return address saved at+8 offset from RBP -
AMDconvention; BUTgccmight default to omit frame pointer using RBP as general-purpose register, breaking stack walk
fixed by
-fno-omit-frame-pointeroption; although most things at Kernel are default GCC compiled breaking this option
- ELF info files in DWARF format; also includes sections containing source & line number
BPF doesn't support it for stack walking being processor intensive; BCC & bpftools does
Intelconvention; can reconstruct stacktrace from recorded branches in hardware with no overhead.. but has limit to depth due to recorded branch limit
not supported by BPF at time of Book's checking
- debug info devised for Stacktraces; better optimized than DWARF
implemented in Linux; supported by BPF; not yet for userspace
Stacktraces recorded in Kernel as address array are later converted to Symbols by user-level program; mapping might change
can be used to comapre CPU profiles; visualize recorded stacktraces from any profiler/tracer
-
Stack Trace: top with current function, move downward with ancestry of parent, and parents of parent
-
Profiling Stack Traces: time sampling 10x100s line long stacktraces;
Linux perf profilersummarizes as call tree showing percent of each path; BCC summarize for each unique trace -
Flame Graph: an adjacency diagram; most frequent stack as widest tower (check them first)
Flame Graph properties
each box represents function in stack
y-axis shows stack depth; looking bottom up is the code flow
x-axis spans sample population; left-to-right IS NOT by time but alphabetical sort of frames
- Frame Graph Features:
hueindicating code type,staurationhashed from fn-name,bgcolormaps to graph type;mouse overrevealing occurency percent;zoomto inspect narrow frames; and asearchto highlight or find
Kernel Tracepoints
- via
uprobesDynamic Instrumentation
Applications, System Libraries, System Call Interface
syscalls:
- via
kprobesDynamic Instrumentation
System Call Interface
syscalls:VFS, File Systems
btrfs:, ext4:, xfs:, Volume Managerjbd2:, Block Device Interfaceblock:Sockets
sock:, skb:, TCP/UDPtcp:, udp:, IP, Net Devicenet:, xdp:Scheduler
cpu-clock, cs, migrations, sched: , task:, signal:, timer:, workqueue:, Virtual Memorypage-faults, minor-faults, major-faults, kmem:, vmscan:, writeback:, huge_memory:, compaction:Device Drives
scsi:, irq:
-
kprobes provides dynamic instrumentation for any kernel function, live in prod without reboot or special mode runs
-
kretprobesis a kprobe interface instrumenting function returns & return-values
timestamps on
kprobes&kretprobesinstrumentation on same function can provide function duration
- if Kprobe is for an address already under Ftrace; Ftrace-based kprobe optimization may be available
Kernel functions skip tracing when listed with
NOKPROBE_SYMBOL() macro
-
kprobe API
-
Ftrace-based via
/sys/kernel/debug/tracing/kprobe_events -
perf_event_open()as used byperftool and recentlyBPF tracing
-
BCC via
attach_kprobe()andattach_kretprobe(); supports instrumenting beginning of function and function-plus-instruction offset -
bpftrace via
kprobeandkretprobeprob types; support instrumenting beginning of function only -
example of BCC in
vfsstattool instrumenting for Virtual File System interface; using hooks likeb.attach_kprobe(event="vfs_read", fn_name="do_read") -
following is bpftrace example counting invocation of VFS functions matching
vfs_*
% sudo bpftrace -e 'kprobe:vfs_* { @[probe] = count() }'
Attaching 67 probes...
^C
@[kprobe:vfs_fallocate]: 1
@[kprobe:vfs_fstat]: 1
@[kprobe:vfs_fsync_range]: 3
...
similar to using
funccounttool from Ftrace perf-trace
references: kprobes kernel doc; kprobes intro, kernel debugging with kprobes
-
provide user-level dynamic instrumentation; when a function in an executable file is executed then all processes using that file are instrumented
-
has
uprobesfor func-entries (with instruction offset) &uretprobes(for return of func); also implemented similar to kprobes
to instrument readline function with bpftrace
sudo bpftrace -e 'uprobe:/bin/bash:readline { @ = count() }'
-
Ftrace based via
/sys/kernel/debug/tracing/uprobe_events -
perf_event_open(); withperf_uprobepmu
register_uprobe_event()kernel function is available, not exposed as an API though
- BCC probes as
attach_uprobe()(can instrument beginning or with offset) &attach_uretprobe()
example of BCC at
gethostlatencyinstruments host resolution calls; using calls likeb.attach_uprobe(name="c". sym="getaddreinfo", fn_name="do_entry", pid=args.pid)
- bpftrace probe types as
uprobe(only beginning) &uretprobe
example of bpftrace as
bpftrace -l 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:gethost*'
- attaching uprobes to million/sec fired events (like malloc, free) can cause slowdowns even if BPF is optimized; such uses are prohibited
shared-lib solution is in discussion to provide BPF tracing for user-space without kernel mode switch
references: uprobes kernel doc
- used for Kernel static instrumentation; involve tracing calls placed in kernel by developers
thus need maintenance but provide stable API across kernel versions; unlike kprobes which might break on function rename
- format of tracepoint is
subsystem:eventname(e.g.kmem:kmalloc)
header file for tracepoint
sched.hunderinclude/trace/eventsdefine trace system as sched and tracepoint named sched_process_exec
- info available at runtime via
Ftraceframework in/sys, via format files for each tracepoint
$ cat /sys/kernel/debug/tracing/events/sched/sched_process_exec/format
name: sched_process_exec
ID: 309
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
...
-
Ftrace-based via
/sys/kernel/debug/tracing/eventshas sub-dirs for each tracepoint system, files for each tracepoint itself -
perf_event_open(), by BPF viaperf_tracepointpmu
-
BCC
TRACEPOINT_PROBE()and bpftrace withtracepointprobe type -
tcplifeis an example of BCC and tracepoints withBPF.tracepoint_exists("sock", "inet_sock_set_state")to usebpf_text_tracepointif exists -
bpftrace one-liner
bpftrace -e 'tracepoint:sched:sched_process_exec { printf("exec by %s\n", comm); }'
% sudo bpftrace -e 'tracepoint:sched:sched_process_exec { printf("exec by %s\n", comm); }'
Attaching 1 probe...
exec by sh
exec by git
^C
BPF_RAW_TRACEPOINT(added Linux 4.17) avoids creation cost for stable tracepoint arguments; making unstable API with access to more fields
references: tracepoints kernel doc, 48
-
USDT (User-level Statically Defined Tracing) provides userspace version of Tracepoints
-
USDT depends on external system tracer to activate; made popular by DTrace
many apps don't compile USDT probes by default but require config option
- can be added to an app using headers and tools from
systemtap-sdt-dev, Facebook's folly
- similar to others, when apps are compiled a
nopis placed at address of USDT probes, then dynamically changed by kernel to a breakpoint when instrumented using uprobes
- BCC with
USDT.enable_probe()and bpftrace withusdtprobe type
references: Hacking Linux USDT with Ftrace, USDT probe support in BPF/BCC, USDT tracing report
-
some languages are run on the fly; dynamic USDT can be used there
-
pre-compiling a shared lib with desired USDT probes embedded in functions (with ELF notes section for USDT probes); this wrapped and used
libstapsdtauto creates shared lib containing USDT probes and ELF notes section at runtime
libusdtbeing worked on for it
-
PMC (Perf Monitoring Counters) also known as PICs (Perf Instrumentation Counters), CPCs (CPU Perf Counters), PMU (Perf Monitoring Unit) events... are all just Programmable Hardware Counters on Processor
-
there are many PMCs; Intel selected 7 as architectural set
-
only PMCs allow measure CPU instructions efficiency; CPU cache hit rations; memory, interconnect and device bus utilization; stall cycles; more
-
there are 100s of PMCs available; but only a fixed (near to 6) can be chosen at a time to measure
-
Counting: PMCs track event rate; with almost zero overhead -
Overflow Sampling: PMCs interrupt kernel at monitored events to collect extra state; might cause serious issues based on occurence rate -
PEBS (Precise Event Based Sampling) uses h/w buffers to record correct instruction pointer at time of PMC event (to manage overflow sampling)
- many Cloud Providers have not provided PMC access to guests