Skip to content

wait_event_timing — independent benchmark & testing (Hacking Postgres 2026-06-17) #48

Description

@NikolayS

wait_event_timing — independent benchmark & testing (tracking issue)

Tracking issue for the independent benchmarking / testing of the wait_event_timing patch series (Dmitry Fomin's PR: DmitryNFomin#2), started during a Hacking Postgres session on 2026-06-17 (Kirk Wolak, Dmitry Fomin, Nik). An AI agent ("Leo", via samograph) joined the call and drove the setup live.

Design brief (living doc, GitHub Pages): https://nikolays.github.io/wet-timing-bench-brief/


Feature under test

wait_event_timing — Oracle-style wait-event instrumentation. Compile-time gate --enable-wait-event-timing (meson -Dwait_event_timing=true); runtime GUC wait_event_capture = off | stats | trace.

  • stats — per-(backend, event) count / total / max + 32-bucket log2 histogram (eager shmem; lazily DSA in later patch).
  • trace — per-session DSA ring of individual waits (wait_event_trace_ring_size = 4 MB/backend, power-of-two, fixed at start) + query-attribution markers.
  • Sizing GUCs: wait_event_trace_ring_size=4096kB, wait_event_timing_max_tranches=192.

Hardware / build

  • Box: Hetzner CCX43 — 16 dedicated vCPU (AMD EPYC-Milan), 64 GB RAM, Ubuntu 24.04, Falkenstein. (ephemeral benchmark host, SSH-key-only, to be destroyed after.)
  • Build: branch wet-series (PR Add comprehensive wait events coverage gap analysis #2 head, commit 983f0ec), base = upstream master 89eafad297a. ./configure --prefix=/opt/pg-wet --enable-wait-event-timing CC="ccache gcc" CFLAGS="-O2" (in-tree, gcc -O2). Baseline = wet-series-base (no flag).

Validity gate (the trap we hit & fixed)

A build configured without the flag still exposes the wait_event_capture GUC but silently downgrades stats/trace to off → off/stats/trace all equal baseline → a false ~0% overhead. Mitigation, run before every benchmark:

  • start a server on the built binary, SET wait_event_capture=stats then SELECT current_setting('wait_event_capture') must return stats (and trace), and pg_stat_wait_event_timing view must exist. No gate pass ⇒ no benchmark.

Methodology (the ladder)

vanilla → patched-stub (flag off) → compiled+capture=off → stats → trace. Two cost concepts kept distinct: static overhead (constant cost of having it compiled/enabled) vs observer effect (dynamic, grows with sampling/reading frequency).

Canonical measured-run command (reused for all workloads)

pgbench -S -c 16 -j 16 -T 180 -P 15 -r -n postgres

-c=-j=16 (= cores; avoids pgbench co-location oversubscription) · -T 180 (3-min, stable) · -P 15 (progress every 15s) · -r (per-statement latency) · -n (no vacuum) · select-only. --log intentionally NOT used (it is itself an observer-effect source). 3 interleaved rounds (baseline→off→stats→trace), median.

Workloads

  • W1 — read-only cached (shared_buffers=8GB, scale 100): inline-gate / CPU-bound cost.
  • W2 — read-only wait-saturated (shared_buffers=256MB, scale 400): TPS + wait-event proof (top events + total transition count; confirms we actually exercise waits).
  • W3 — same as W2 but -R <50% of W2 baseline TPS>: latency under headroom (regular load-test, not stress). + CPU-idle sampling to confirm headroom.
  • Clusters: fsync=on, compute_query_id=on + track_activities=on (required for trace query-attribution; kept on for all variants = apples-to-apples).

Preliminary results (iteration 1, no pg_stat_statements — RUNNING)

W1 (read-only cached), early rounds, on the verified build:

variant median TPS Δ vs baseline
baseline ~222,900
off ~222,930 +0.01%
stats ~223,160 +0.12%
trace ~222,100 −0.36%

→ all within ±0.4% = statistically indistinguishable (virtualized-env noise). The discriminating numbers come from W2 (wait-saturated).

Iterations

  • Round 1 (running): revised config, no pg_stat_statements — the without-pgss reference.
  • Round 2 (staged, ready): + pg_stat_statements (shared_preload_libraries + CREATE EXTENSION on both clusters) to add realistic shared-hash/LWLock contention (more wait events); resets pgss before each run, cross-checks its call count (≈ TPS×180), dumps top queries alongside the wait events. Also CPU-idle headroom sampling.

Open questions / next steps

  • Stub build (patched, compiled without the flag) — measures cost of the source changes when #ifdef'd out (expected ≈0). Not yet built.
  • W4 (read-side observer effect): while one pgbench drives the workload, a second client polls the wait-event views at increasing frequency → measures how reading the instrumentation perturbs what it measures.
  • Per-iteration results HTML pages (with Python charts) + a "what changed vs previous round" diff.
  • ≥3 independent review agents post-completion (performance + correctness); re-run if anything is off.
  • Store raw bench.log per round.

Artifacts

Scripts are in the collapsible sections below (live host IP redacted as <BENCH_HOST>). Raw bench.log per round will be attached as runs complete.

Scripts

run_bench.sh (Round 2 — current, with pg_stat_statements + CPU-headroom + top-queries)
#!/bin/bash
# REVISED benchmark (team spec 2026-06-17): select-only, -c16 -j16 -T180 -P15 -r -n, 3 rounds
#  W1 cached (SB=8GB, s100); W2 wait-saturated (SB=256MB, s400) + wait-event proof;
#  W3 = W2 but -R at 50% of W2 baseline median TPS (latency focus). No TPC-B.
BASELINE_BIN=/opt/pg-baseline/bin
WET_BIN=/opt/pg-wet/bin
BASELINE_DATA=/data/pg-baseline
WET_DATA=/data/pg-wet
BPORT=5440
WPORT=5441
LOG=/root/bench.log
DUR=180
mkdir -p /tmp/bench_results
exec > >(tee -a $LOG) 2>&1
log() { echo "[$(date '+%H:%M:%S')] $*"; }

stop_all() {
    su - pgbench -c "$BASELINE_BIN/pg_ctl -D $BASELINE_DATA stop -m fast 2>/dev/null; exit 0" 2>/dev/null || true
    su - pgbench -c "$WET_BIN/pg_ctl -D $WET_DATA stop -m fast 2>/dev/null; exit 0" 2>/dev/null || true
    sleep 3
}

set_conf() { # datadir key value
    python3 -c "
import re
p='$1/postgresql.conf'
c=open(p).read()
c=re.sub(r'^#?\s*$2\s*=.*', \"$2 = '$3'\", c, flags=re.M)
if '$2' not in c: c+=\"\n$2 = '$3'\n\"
open(p,'w').write(c)
"
}

start_baseline() { # sb
    stop_all
    set_conf "$BASELINE_DATA" shared_buffers "$1"
    set_conf "$BASELINE_DATA" shared_preload_libraries "pg_stat_statements"
    su - pgbench -c "$BASELINE_BIN/pg_ctl -D $BASELINE_DATA -l /tmp/baseline.log start"
    sleep 4
}
start_wet() { # sb capture
    stop_all
    set_conf "$WET_DATA" shared_buffers "$1"
    set_conf "$WET_DATA" shared_preload_libraries "pg_stat_statements"
    set_conf "$WET_DATA" wait_event_capture "$2"
    su - pgbench -c "$WET_BIN/pg_ctl -D $WET_DATA -l /tmp/wet.log start"
    sleep 4
    local actual
    actual=$(su - pgbench -c "$WET_BIN/psql -p $WPORT -tAc 'SHOW wait_event_capture;' postgres" 2>/dev/null | tr -d ' \n')
    log "    confirmed wait_event_capture=$actual"
}

# run_pgbench BIN PORT EXTRA  -> sets globals R_TPS R_LAT R_CALLS R_IDLE
run_pgbench() {
    local bin="$1" port="$2" extra="$3"
    # reset pg_stat_statements before the run (cross-check call count after)
    su - pgbench -c "$bin/psql -p $port -tAc 'SELECT pg_stat_statements_reset();' postgres" >/dev/null 2>&1
    # sample CPU during the run (vmstat 'id' = idle%); confirms saturation (W1/W2) vs headroom (W3)
    ( vmstat 30 7 >/tmp/last_vmstat.txt 2>&1 ) & local vmpid=$!
    su - pgbench -c "$bin/pgbench -p $port -S -c 16 -j 16 -T $DUR -P 15 -r -n $extra postgres 2>&1" | tee /tmp/last_pgbench.txt
    kill $vmpid 2>/dev/null
    R_TPS=$(grep -oP 'tps = \K[0-9.]+' /tmp/last_pgbench.txt | head -1)
    R_LAT=$(grep -oP 'latency average = \K[0-9.]+' /tmp/last_pgbench.txt | head -1)
    R_CALLS=$(su - pgbench -c "$bin/psql -p $port -tAc 'SELECT sum(calls) FROM pg_stat_statements;' postgres" 2>/dev/null | tr -d '[:space:]')
    R_IDLE=$(awk 'NR>2 && NF>=5 {sum+=$(NF-2); n++} END {if(n>0) printf "%.0f", sum/n}' /tmp/last_vmstat.txt 2>/dev/null)
    log "    cross-check: pgss_calls=$R_CALLS  avg_cpu_idle=${R_IDLE}%"
}

# median of 3 values (middle) ; or 2 -> avg
med3() { python3 -c "import sys; v=sorted(float(x) for x in sys.argv[1:]); print(v[len(v)//2] if len(v)%2 else (v[len(v)//2-1]+v[len(v)//2])/2)" "$@"; }

run_workload() { # NAME SB EXTRA  (EXTRA passed to pgbench, e.g. '-R 1234')  -> sets MED_BL MED_OFF MED_STA MED_TRA (TPS) and LAT_*
    local name="$1" sb="$2" extra="$3"
    local -a BL OFF STA TRA LBL LOF LST LTR
    for r in 1 2 3; do
        log "$name Round $r/3  (SB=$sb ${extra:+$extra})"
        start_baseline "$sb"; run_pgbench $BASELINE_BIN $BPORT "$extra"; BL[$r]=$R_TPS; LBL[$r]=$R_LAT; log "  BL tps=$R_TPS lat=${R_LAT}ms"
        start_wet "$sb" off;   run_pgbench $WET_BIN $WPORT "$extra";    OFF[$r]=$R_TPS; LOF[$r]=$R_LAT; log "  OFF tps=$R_TPS lat=${R_LAT}ms"
        start_wet "$sb" stats; run_pgbench $WET_BIN $WPORT "$extra";    STA[$r]=$R_TPS; LST[$r]=$R_LAT; log "  STA tps=$R_TPS lat=${R_LAT}ms"
        start_wet "$sb" trace; run_pgbench $WET_BIN $WPORT "$extra";    TRA[$r]=$R_TPS; LTR[$r]=$R_LAT; log "  TRA tps=$R_TPS lat=${R_LAT}ms"
    done
    MED_BL=$(med3 ${BL[1]} ${BL[2]} ${BL[3]}); MED_OFF=$(med3 ${OFF[1]} ${OFF[2]} ${OFF[3]})
    MED_STA=$(med3 ${STA[1]} ${STA[2]} ${STA[3]}); MED_TRA=$(med3 ${TRA[1]} ${TRA[2]} ${TRA[3]})
    LATMED_BL=$(med3 ${LBL[1]} ${LBL[2]} ${LBL[3]}); LATMED_OFF=$(med3 ${LOF[1]} ${LOF[2]} ${LOF[3]})
    LATMED_STA=$(med3 ${LST[1]} ${LST[2]} ${LST[3]}); LATMED_TRA=$(med3 ${LTR[1]} ${LTR[2]} ${LTR[3]})
    log "=== $name MEDIANS  TPS:  BL=$MED_BL OFF=$MED_OFF STA=$MED_STA TRA=$MED_TRA"
    log "=== $name MEDIANS  LAT(ms): BL=$LATMED_BL OFF=$LATMED_OFF STA=$LATMED_STA TRA=$LATMED_TRA"
    python3 -c "
bl=$MED_BL
for n,v in [('off',$MED_OFF),('stats',$MED_STA),('trace',$MED_TRA)]:
    print('  %-6s TPS %%vs baseline: %+.2f%%'%(n,(v-bl)/bl*100))
"
}

log "================ REVISED BENCHMARK START ================"
log "Spec: pgbench -S -c 16 -j 16 -T 180 -P 15 -r -n ; 3 rounds ; W1 cached / W2 wait-saturated / W3 = W2 -R 50%"

# ---- W1: cached, scale 100 ----
log "=== Init scale 100 (W1) on both clusters + CREATE EXTENSION pg_stat_statements ==="
start_baseline 8GB
su - pgbench -c "$BASELINE_BIN/psql -p $BPORT -c 'CREATE EXTENSION IF NOT EXISTS pg_stat_statements;' postgres" 2>&1 | tail -1
su - pgbench -c "$BASELINE_BIN/pgbench -p $BPORT -i -s 100 --no-vacuum postgres 2>&1 | tail -2"; stop_all
start_wet 8GB off
su - pgbench -c "$WET_BIN/psql -p $WPORT -c 'CREATE EXTENSION IF NOT EXISTS pg_stat_statements;' postgres" 2>&1 | tail -1
su - pgbench -c "$WET_BIN/pgbench -p $WPORT -i -s 100 --no-vacuum postgres 2>&1 | tail -2"; stop_all
log "=== W1: Read-only CACHED (SB=8GB, scale 100) ==="
run_workload W1 8GB ""

# ---- W2 + W3: wait-saturated, scale 400 ----
log "=== Init scale 400 (W2/W3) on both clusters ==="
start_baseline 256MB; su - pgbench -c "$BASELINE_BIN/pgbench -p $BPORT -i -s 400 --no-vacuum postgres 2>&1 | tail -2"; stop_all
start_wet 256MB off;  su - pgbench -c "$WET_BIN/pgbench -p $WPORT -i -s 400 --no-vacuum postgres 2>&1 | tail -2"; stop_all
log "=== W2: Read-only WAIT-SATURATED (SB=256MB, scale 400) ==="
run_workload W2 256MB ""
W2_BL_TPS=$MED_BL

# wait-event proof on W2 stats
log "=== W2 wait-event PROOF (stats, sampled) ==="
start_wet 256MB stats
su - pgbench -c "$WET_BIN/psql -p $WPORT -c 'SELECT pg_stat_reset_wait_event_timing_all();' postgres" 2>/dev/null
su - pgbench -c "$WET_BIN/pgbench -p $WPORT -S -c 16 -j 16 -T 90 -n --no-vacuum postgres" &
BENCH_PID=$!
sleep 30
log "--- pg_stat_activity sample (active backends by wait) ---"
su - pgbench -c "$WET_BIN/psql -p $WPORT -c \"SELECT coalesce(wait_event_type,'CPU') t, coalesce(wait_event,'running') e, count(*) FROM pg_stat_activity WHERE state='active' AND pid<>pg_backend_pid() GROUP BY 1,2 ORDER BY 3 DESC LIMIT 10;\" postgres" 2>/dev/null
wait $BENCH_PID
log "--- pg_stat_wait_event_timing top events (by count) ---"
su - pgbench -c "$WET_BIN/psql -p $WPORT -c \"SELECT wait_event_type, wait_event, sum(count) total_count, round((sum(total_wait_time_us)/1000.0)::numeric,1) total_wait_ms FROM pg_stat_wait_event_timing GROUP BY 1,2 ORDER BY 3 DESC LIMIT 12;\" postgres" 2>/dev/null
W2_TOTAL=$(su - pgbench -c "$WET_BIN/psql -p $WPORT -tAc 'SELECT sum(count) FROM pg_stat_wait_event_timing;' postgres" 2>/dev/null | tr -d ' \n')
log "W2 total wait-event transitions captured: $W2_TOTAL (over ~90s)"
log "--- pg_stat_statements top queries (the queries that generated the waits) ---"
su - pgbench -c "$WET_BIN/psql -p $WPORT -c \"SELECT calls, round(total_exec_time::numeric,1) total_ms, round(mean_exec_time::numeric,4) mean_ms, left(query,60) query FROM pg_stat_statements ORDER BY calls DESC LIMIT 8;\" postgres" 2>/dev/null
stop_all

# ---- W3: same as W2 but rate-limited to 50% of W2 baseline TPS (latency focus) ----
W3_RATE=$(python3 -c "print(int(0.5*$W2_BL_TPS))")
log "=== W3: LATENCY under 50% capacity (SB=256MB, scale 400, -R $W3_RATE = 50% of W2 baseline TPS $W2_BL_TPS) ==="
run_workload W3 256MB "-R $W3_RATE"

log "================ BENCHMARK COMPLETE ================"
log "SUMMARY (medians):"
log "W1 cached     TPS  BL=$MED_BL"  # note: globals hold last workload; full per-workload medians logged above
log "Done. Full per-workload medians + W2 wait events are in $LOG above. W3 headline = latency (LAT lines)."
rebuild_wet.sh (clean instrumented build + verification gate)
#!/bin/bash
# Rebuild the INSTRUMENTED postgres correctly (clean in-tree build), with verification gate.
set -uo pipefail
echo "[$(date +%H:%M:%S)] === Stopping any in-flight benchmark & clusters ==="
pkill -f run_bench.sh 2>/dev/null
pkill -f 'bin/pgbench' 2>/dev/null
pkill -f 'bin/postgres -D' 2>/dev/null
sleep 3

echo "[$(date +%H:%M:%S)] === Clean rebuild of /opt/pg-wet from wet-series WITH --enable-wait-event-timing ==="
cd /root/pg-src
git checkout -f wet-series
echo "[$(date +%H:%M:%S)] git clean -fdx (remove ALL prior build artifacts incl. ignored .o)..."
git clean -fdx -q
rm -rf /root/build-wet
echo "[$(date +%H:%M:%S)] configure (in-tree)..."
./configure --prefix=/opt/pg-wet --enable-wait-event-timing CC="ccache gcc" CFLAGS="-O2" > /root/wet-configure.log 2>&1
if grep -q -- '--enable-wait-event-timing' config.log; then
  echo "  configure OK: flag present"
else
  echo "  !!! configure did NOT record the flag — aborting"; tail -8 /root/wet-configure.log; exit 1
fi
echo "[$(date +%H:%M:%S)] make -j16 (few minutes)..."
make -j16 > /root/wet-make.log 2>&1 || { echo "MAKE FAILED"; tail -20 /root/wet-make.log; exit 1; }
rm -rf /opt/pg-wet
make install > /root/wet-install.log 2>&1 || { echo "INSTALL FAILED"; tail -20 /root/wet-install.log; exit 1; }
echo "[$(date +%H:%M:%S)] installed. configure line now:"
/opt/pg-wet/bin/pg_config --configure | tr ' ' '\n' | grep -i 'wait\|prefix'

echo "[$(date +%H:%M:%S)] === VERIFICATION GATE: does capture=stats/trace actually stick? ==="
P=/opt/pg-wet/bin
pkill -f 'D /tmp/guctest' 2>/dev/null; sleep 1; rm -rf /tmp/guctest
$P/initdb -D /tmp/guctest >/tmp/guctest_init.log 2>&1
$P/pg_ctl -D /tmp/guctest -o "-p 5999" -l /tmp/guctest.log start >/dev/null 2>&1
sleep 3
STATS=$($P/psql -p 5999 -d postgres -tAc "SET wait_event_capture=stats; SHOW wait_event_capture;" 2>&1 | tr -d '[:space:]')
TRACE=$($P/psql -p 5999 -d postgres -tAc "SET wait_event_capture=trace; SHOW wait_event_capture;" 2>&1 | tr -d '[:space:]')
$P/pg_ctl -D /tmp/guctest stop >/dev/null 2>&1
echo "  SET stats -> '$STATS'   SET trace -> '$TRACE'"
if [ "$STATS" = "stats" ] && [ "$TRACE" = "trace" ]; then
  echo "[$(date +%H:%M:%S)] VERIFIED: instrumented build accepts stats AND trace. Build is valid."
  touch /root/WET_BUILD_VERIFIED
else
  echo "[$(date +%H:%M:%S)] NOT VERIFIED: capture did not stick (stats='$STATS' trace='$TRACE'). Do NOT benchmark."
  rm -f /root/WET_BUILD_VERIFIED
  exit 1
fi
go.sh (re-init wet cluster + gate + run)
#!/bin/bash
# Re-init wet cluster with corrected binary -> verification gate (as pgbench) -> full benchmark
set -uo pipefail
WET=/opt/pg-wet/bin
BASE=/opt/pg-baseline/bin
log(){ echo "[$(date +%H:%M:%S)] $*"; }

log "=== Stopping everything ==="
su - pgbench -c "$WET/pg_ctl -D /data/pg-wet stop -m immediate 2>/dev/null; exit 0" 2>/dev/null || true
su - pgbench -c "$BASE/pg_ctl -D /data/pg-baseline stop -m immediate 2>/dev/null; exit 0" 2>/dev/null || true
pkill -f run_bench.sh 2>/dev/null || true
pkill -u pgbench pgbench 2>/dev/null || true
sleep 3

log "=== Re-initdb /data/pg-wet with the CORRECTED instrumented binary ==="
rm -rf /data/pg-wet
su - pgbench -c "$WET/initdb -D /data/pg-wet" >/root/wet-reinit.log 2>&1 || { echo "REINIT FAILED"; tail -8 /root/wet-reinit.log; exit 1; }
cat >> /data/pg-wet/postgresql.conf <<'EOF'
port = 5441
max_connections = 300
fsync = off
max_wal_size = 8GB
track_activities = on
compute_query_id = on
wait_event_capture = off
EOF
chown pgbench:pgbench /data/pg-wet/postgresql.conf

log "=== VERIFICATION GATE (as pgbench): capture=stats/trace must stick + view must exist ==="
su - pgbench -c "$WET/pg_ctl -D /data/pg-wet -l /tmp/wet_gate.log start"
for i in $(seq 1 25); do su - pgbench -c "$WET/pg_isready -p 5441 -q" && break; sleep 1; done
STATS=$(su - pgbench -c "$WET/psql -p 5441 -d postgres -tAc \"SET wait_event_capture=stats; SELECT current_setting('wait_event_capture');\"" 2>&1 | tail -1 | tr -d '[:space:]')
TRACE=$(su - pgbench -c "$WET/psql -p 5441 -d postgres -tAc \"SET wait_event_capture=trace; SELECT current_setting('wait_event_capture');\"" 2>&1 | tail -1 | tr -d '[:space:]')
VIEW=$(su - pgbench -c "$WET/psql -p 5441 -d postgres -tAc \"SELECT count(*) FROM pg_stat_wait_event_timing;\"" 2>&1 | tr -d '[:space:]')
su - pgbench -c "$WET/pg_ctl -D /data/pg-wet stop -m fast" 2>/dev/null || true
log "RESULT: SET stats -> '$STATS' ; SET trace -> '$TRACE' ; pg_stat_wait_event_timing rows -> '$VIEW'"
if [ "$STATS" = "stats" ] && [ "$TRACE" = "trace" ]; then
  log "✅ VERIFIED — build is genuinely instrumented (GUC accepts stats/trace; instrumentation view present). Proceeding."
  touch /root/WET_BUILD_VERIFIED
else
  log "❌ NOT VERIFIED (stats='$STATS' trace='$TRACE'). Aborting — will NOT produce misleading numbers."
  exit 1
fi

log "=== Launching full interleaved benchmark (baseline / off / stats / trace) ==="
bash /root/run_bench.sh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions