Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
252 commits
Select commit Hold shift + click to select a range
bddcfe5
docs: address review questions on Hetzner CI plan
paddymul Mar 1, 2026
62aeddb
feat: implement Hetzner self-hosted CI
paddymul Mar 1, 2026
aea3201
feat: add ci/hetzner/lib helpers and fix .gitignore
paddymul Mar 1, 2026
282e357
fix: cloud-init owner/yaml bugs; status.sh no-op without GITHUB_TOKEN
paddymul Mar 1, 2026
5ee2550
fix: Dockerfile needs build-essential for cffi/cryptography; cloud-in…
paddymul Mar 1, 2026
76bf7e0
fix: git safe.directory in container, --allow-root for jupyter, remov…
paddymul Mar 1, 2026
58531f9
fix: bake CI runner scripts into image to survive git checkout of any…
paddymul Mar 1, 2026
34edec3
fix: remove uv sync from job_lint_python to prevent venv race condition
paddymul Mar 1, 2026
5e95917
fix: bake jupyter_lab_config.py to allow root in Docker container
paddymul Mar 1, 2026
acf9176
fix: deselect Docker-incompatible tests; skip Python 3.14 alpha
paddymul Mar 1, 2026
a373b9b
docs: update hetzner-ci-bringup with final clean run results
paddymul Mar 2, 2026
f05e4d7
perf: parallelize Phase 3 Python tests (3.11/3.12/3.14)
paddymul Mar 2, 2026
1773af1
docs: add warm cache and parallel Phase 3 timing results
paddymul Mar 2, 2026
66f8038
perf: parallelize Phase 5 playwright tests
paddymul Mar 2, 2026
b1ec8cd
fix: resolve parallel Phase 5 venv races
paddymul Mar 2, 2026
e130e1f
feat: add DAG-based CI orchestrator and research docs
paddymul Mar 2, 2026
185f3a7
fix: apply venv and HTML report isolation to run-ci-dag.sh
paddymul Mar 2, 2026
e9b102a
feat: add parallel Playwright jupyter test runner
paddymul Mar 2, 2026
29e8752
feat: use parallel jupyter playwright runner in CI
paddymul Mar 2, 2026
50bc763
fix: bake test_playwright_jupyter_parallel.sh into /opt/ci-runner/
paddymul Mar 2, 2026
1ed1e16
fix: preserve exit code through rm -rf in job_playwright_jupyter
paddymul Mar 2, 2026
06b65f9
fix: respect ROOT_DIR env var in parallel jupyter script
paddymul Mar 2, 2026
4064949
perf: split build-js from test-js in DAG for earlier wheel build
paddymul Mar 2, 2026
6acc85d
fix: move pw-marimo and pw-wasm-marimo to wheel-dependent jobs
paddymul Mar 2, 2026
9aa6a3e
fix: add XXXXXX to mktemp template for Linux compatibility
paddymul Mar 2, 2026
b27e42b
feat: log CI runner version at start of each run
paddymul Mar 2, 2026
73ad85a
docs: add implementation-notes.md with CI lessons learned
paddymul Mar 2, 2026
e14f2a0
fix: guard arithmetic ops with || true to prevent set -e early exit
paddymul Mar 2, 2026
8462cab
perf: lower PARALLEL jupyter to 3 (was 9, caused widget comm failures)
paddymul Mar 2, 2026
e49fe9b
perf: split Phase 5 — jupyter runs after other playwright jobs (5a→5b)
paddymul Mar 2, 2026
37fb477
docs: update implementation-notes with parallel jupyter bugs
paddymul Mar 2, 2026
6a8234b
fix: shutdown kernels between batches in parallel jupyter runner
paddymul Mar 2, 2026
ad02d52
fix: use explicit batching instead of sliding window for jupyter para…
paddymul Mar 2, 2026
b9cf8d3
fix: || true on shutdown_kernels pipelines, fix BATCH_PIDS re-declare
paddymul Mar 2, 2026
b23449a
fix: set PARALLEL=1 for jupyter notebooks (800ms wait too short for >1)
paddymul Mar 2, 2026
16ac102
feat: add stress-test.sh with safe, failing, and older commit sets
paddymul Mar 2, 2026
1759612
fix: trust notebooks and fix shutdown_kernels JSON parsing in paralle…
paddymul Mar 2, 2026
517e54a
docs: update implementation notes with run 26 results and new bugs found
paddymul Mar 2, 2026
60a22aa
feat: add stress-test.sh and download-ci-logs.sh for CI validation
paddymul Mar 2, 2026
affe14a
fix: clean stale kernel runtime files + enable PARALLEL=3 for Phase 5b
paddymul Mar 2, 2026
72c463c
docs: record stale runtime kernel files bug + PARALLEL=3 update
paddymul Mar 2, 2026
27297dd
revert: PARALLEL=3 → 1 for playwright-jupyter
paddymul Mar 2, 2026
c8b3ff9
docs: record PARALLEL=3 failure + batch-1 timing root cause
paddymul Mar 2, 2026
2ae4c00
ci: ignore mp_timeout_decorator_test.py entirely in Docker
paddymul Mar 2, 2026
f9dc8c1
ci: add stagger runner and stress-test --stagger flag
paddymul Mar 2, 2026
995e61d
ci: add serial runner to measure uncontended job timings
paddymul Mar 2, 2026
9635008
ci: harden playwright-jupyter batch-1 and fix mcp-wheel exit code
paddymul Mar 2, 2026
4871500
ci: fix warmup pipefail exit triggering cleanup trap
paddymul Mar 2, 2026
9a6122b
docs: update implementation notes with purpose, timings, and next steps
paddymul Mar 2, 2026
65d49b2
ci: replace static waitForTimeout with proper waitFor in Jupyter spec…
paddymul Mar 2, 2026
5e86490
ci: try PARALLEL=2 for playwright-jupyter (PARALLEL=3 causes WebSocke…
paddymul Mar 2, 2026
e8c429c
ci: increase cell execution timeout to 20s for concurrent kernel startup
paddymul Mar 2, 2026
55707c1
ci: revert to PARALLEL=1 for playwright-jupyter — PARALLEL>1 causes Z…
paddymul Mar 2, 2026
f46971d
ci: isolated JupyterLab server per parallel slot — eliminates ZMQ con…
paddymul Mar 2, 2026
d6bc031
fix: add sequential server startup to playwright-jupyter parallel runner
paddymul Mar 2, 2026
68fd933
fix: use DEFAULT_TIMEOUT in infinite-scroll-transcript fallback cell …
paddymul Mar 2, 2026
2c3d5a7
fix: increase warmup kernel polling to 60 iterations (30s timeout)
paddymul Mar 2, 2026
bf904a8
feat: add --phase=5b option and wheel cache to run-ci.sh
paddymul Mar 2, 2026
92a99aa
fix: replace warmup kernels with sleep 20s in parallel jupyter runner
paddymul Mar 2, 2026
5487f99
fix: pre-warm Python bytecaches + fix infinite-scroll test 2 scroll s…
paddymul Mar 2, 2026
f08937c
fix: extract static files from wheel in --phase=5b mode
paddymul Mar 2, 2026
6220264
fix: stagger batch-1 by 5s + increase CELL_EXEC_TIMEOUT to 45s + 2s s…
paddymul Mar 2, 2026
de8e37d
fix: test timeout 60s, scroll-to-top before cell checks, waitFor row-…
paddymul Mar 2, 2026
1b549cc
fix: textContent cell check, wait-for-detach in test2, 10s batch-1 st…
paddymul Mar 2, 2026
1014008
fix: outputArea textContent check, test2 use existing output + waitFo…
paddymul Mar 2, 2026
6f27b83
fix: rename duplicate outputText var, scope row-index=0 to data grid …
paddymul Mar 2, 2026
18faae2
exp19: revert test assertions to original, slack timing (30s sleep, 2…
paddymul Mar 2, 2026
a719762
exp20: CELL_EXEC_TIMEOUT 60s, waitForAgGrid state:visible, run_one 90s
paddymul Mar 2, 2026
14522d6
feat: DAG-based CI execution, replace 5-phase structure
paddymul Mar 2, 2026
f47af9e
fix: ignore all multiprocessing_executor_test.py in Docker CI
paddymul Mar 2, 2026
8c9eb16
fix: deselect test_server_starts_and_responds in Docker CI
paddymul Mar 2, 2026
f47ba16
fix: ignore test_mcp_server_integration.py in unit test runs
paddymul Mar 2, 2026
43af33f
fix: wait for marimo/wasm playwright before starting pw-jupyter
paddymul Mar 2, 2026
17db864
docs: add exp 8-9 and wheel cache infrastructure to experiment log
paddymul Mar 2, 2026
3865fd9
fix: add marimo server warmup and increase playwright-marimo timeouts
paddymul Mar 3, 2026
e35565a
fix: move playwright-marimo after build-wheel in DAG
paddymul Mar 3, 2026
e278032
fix: add --wheel-from=SHA option for iterating on test code
paddymul Mar 3, 2026
3abdcd7
fix: remove 'local' outside function in run-ci.sh
paddymul Mar 3, 2026
a25307e
fix: replace 30s sleep with active kernel warmup in jupyter runner
paddymul Mar 3, 2026
e3cb3fd
feat: PARALLEL=9 for playwright-jupyter with concurrent kernel warmup
paddymul Mar 3, 2026
a1594bd
fix: WebSocket-based kernel warmup + remove batch-1 stagger
paddymul Mar 3, 2026
7e5754a
fix: unique Playwright output dir per parallel notebook slot
paddymul Mar 3, 2026
a869d12
feat: pytest-xdist for parallel unit tests + fix infinite scroll time…
paddymul Mar 3, 2026
2207d1e
fix: reduce infinite scroll notebook to 500 rows, bump test timeout
paddymul Mar 3, 2026
6c1c743
fix: PARALLEL=8 so infinite_scroll runs alone in batch 2
paddymul Mar 3, 2026
c2a16ec
fix: bump CELL_EXEC_TIMEOUT to 120s and test timeout to 180s
paddymul Mar 3, 2026
4cd4ccb
fix: robust cell focus before Shift+Enter in Playwright specs
paddymul Mar 3, 2026
fac3cb5
fix: wait for kernel idle before Shift+Enter in Playwright specs
paddymul Mar 3, 2026
4cd68b7
exp: try PARALLEL=4 — 8 is too flaky under full DAG contention
paddymul Mar 3, 2026
61e9947
fix: retry Shift+Enter every 10s until cell output appears
paddymul Mar 3, 2026
dc360ac
fix: bump DEFAULT_TIMEOUT and NAVIGATION_TIMEOUT to 30s
paddymul Mar 3, 2026
35e0fc8
fix: use dispatchEvent for cell execution retry, avoid click() visibi…
paddymul Mar 3, 2026
7770774
fix: wait for ALL jobs before playwright-jupyter + add Playwright ret…
paddymul Mar 3, 2026
92ca618
exp: try PARALLEL=3 for more reliable playwright-jupyter
paddymul Mar 3, 2026
6a11b71
fix: wait for kernel idle before cell execution in Playwright specs
paddymul Mar 3, 2026
8695488
fix: reduce kernel idle wait to 15s, increase Playwright retries to 2
paddymul Mar 3, 2026
c2f7cbc
docs: update experiment log with results + plan non-jupyter experiments
paddymul Mar 3, 2026
cb585c2
docs: add exp 14e final results — 4/5 pass, 80% ceiling at PARALLEL=4
paddymul Mar 3, 2026
0fc5fb7
docs: add exp 21-22 — jupyterapp internal state query for kernel read…
paddymul Mar 3, 2026
5994612
fix: replace waitForTimeout with polling + jupyterapp kernel check
paddymul Mar 3, 2026
7f7621b
docs: add exp 15-21 results — 100% jupyter pass rate, 2m59s median
paddymul Mar 3, 2026
200bac6
feat: CI job queue, JS build cache, synthetic merge support
paddymul Mar 3, 2026
e7fff5b
fix: mount js-cache volume for build cache persistence
paddymul Mar 3, 2026
5445eb7
fix: ci-queue mkdir log dir before exec, avoid double log output
paddymul Mar 3, 2026
f30da68
fix: ci-queue worker double log lines — nohup to /dev/null
paddymul Mar 3, 2026
5c1e58f
fix: full_build.sh check index.es.js not index.js for skip logic
paddymul Mar 3, 2026
ec8956d
docs: update experiment log — exp 23/24, CPU data, projected impacts
paddymul Mar 3, 2026
60618ce
feat: implement exp 18/19/20 — parallel smoke, relaxed gate, marimo w…
paddymul Mar 3, 2026
d020744
feat: exp 29 — marimo assertion robustness from flakiness research
paddymul Mar 3, 2026
8fcbe9a
docs: update experiment doc with exp 18/19/20/23/24 results
paddymul Mar 3, 2026
137cc92
docs: fix exp 26 description — wheel bundles JS, cache key needs both
paddymul Mar 3, 2026
172158b
feat: exp 28 — early kernel warmup overlaps with Wave 0
paddymul Mar 3, 2026
c53967c
docs: update experiment log with exp 28 results
paddymul Mar 3, 2026
d24bbc4
docs: add CPU monitoring requirement for CI experiments
paddymul Mar 3, 2026
d369894
feat: exp 30 — remove heavyweight PW gate, add CPU monitoring
paddymul Mar 3, 2026
526a120
fix: use vmstat instead of mpstat for CPU monitoring
paddymul Mar 3, 2026
5970802
docs: add exp 30 results — remove heavyweight gate, 1m43s total
paddymul Mar 3, 2026
6c82f89
feat: exp 31 — PARALLEL=9 for pw-jupyter (all notebooks at once)
paddymul Mar 3, 2026
b2398d5
feat: exp 32 — revert P=4, move wasm-marimo after wheel, defer pytest
paddymul Mar 3, 2026
3340ce9
docs: add Exp 31/32 results — P=9 abandoned, lean Wave 0 +8s vs Exp 30
paddymul Mar 3, 2026
5279196
feat: exp 33 — staggered sub-waves, PARALLEL=6, fine-grain CPU (0.1s)
paddymul Mar 3, 2026
8478735
fix: exp 33 — batch 2 re-warmup, 120s pw-jupyter timeout, 210s CI wat…
paddymul Mar 3, 2026
076f40f
fix: remove `local` outside function in batch re-warmup loop
paddymul Mar 3, 2026
0e98e13
feat: exp 33 — try PARALLEL=9 for pw-jupyter (all 9 notebooks at once)
paddymul Mar 3, 2026
75a81b2
feat: exp 33 — stagger PARALLEL=9 Chromium launches by 1s
paddymul Mar 3, 2026
b566296
feat: exp 33 — try 2s stagger for PARALLEL=9
paddymul Mar 3, 2026
553bea0
feat: exp 33 — BASE_PORT=8900 for PARALLEL=9 (test port theory)
paddymul Mar 3, 2026
9dcc5e0
fix: add pre-run cleanup to run-ci.sh — kill stale processes, rm temp…
paddymul Mar 3, 2026
08724ad
docs: exp 33 results — P=6 batched wins, P=9 conclusively dead
paddymul Mar 3, 2026
630cf60
feat: exp 34+36 — SKIP_INSTALL, nice priority, auto-retry server asse…
paddymul Mar 3, 2026
da3a7ad
fix: use renice instead of nice for shell functions in run-ci.sh
paddymul Mar 3, 2026
2ba10e7
fix: don't renice jupyter-warmup (servers persist), SKIP_INSTALL in p…
paddymul Mar 3, 2026
5996d8c
docs: exp 34+36 results — pw-server flake fixed, pw-jupyter zombie re…
paddymul Mar 3, 2026
382c9e6
docs: split experiments doc into current state + historical archive
paddymul Mar 3, 2026
20fb931
fix: add init:true to docker-compose for zombie reaping (Exp 37)
paddymul Mar 3, 2026
46c165c
fix: use tini ENTRYPOINT instead of init:true for zombie reaping
paddymul Mar 3, 2026
54edcab
fix: clean workspace files in pre-run cleanup, PARALLEL=5 (Exp 38)
paddymul Mar 3, 2026
5416e0c
fix: aggressive pre-run cleanup — SIGKILL, port fuser, cache purge
paddymul Mar 3, 2026
8d9c638
fix: bump pw-jupyter timeout 120→180s, watchdog 210→270s
paddymul Mar 3, 2026
ef53834
revert: restore Exp 33 pw-jupyter config (P=6, 120s timeout, 210s wat…
paddymul Mar 3, 2026
fff99fa
fix: revert PARALLEL=6→4 — P=6 no longer reliable on current image
paddymul Mar 3, 2026
9a15704
docs: update experiments — tini validated, P=4 stable, P=6 regressed
paddymul Mar 3, 2026
98d0a64
docs: clarify Exp 26 scope — CI-dev-only edge case, not for real CI
paddymul Mar 3, 2026
c5a0498
docs: add research notes from CI optimization effort
paddymul Mar 3, 2026
4a7fefc
feat: Exp 35 split build-js/test-js + fix lockfile hash persistence
paddymul Mar 3, 2026
f5d2b55
docs: Exp 35+39 validated — build-js split saves 4s, lockfile hashes …
paddymul Mar 3, 2026
2fd049c
docs: update CPU profile with 4a7fefc data, add Exp 35+39 run history
paddymul Mar 3, 2026
2dff214
feat: add run-pw-jupyter.sh — fast pw-jupyter iteration script
paddymul Mar 3, 2026
0e7c66e
fix: bump pw-jupyter timeout 160→240s, parallelize notebook trust
paddymul Mar 3, 2026
87cb918
fix: wait for trust PIDs only, not all background jobs
paddymul Mar 3, 2026
8e4334b
feat: add per-process monitor script and exploration results log
paddymul Mar 3, 2026
54ae375
docs: Exp 1 complete — settle=0 works; fix kernel process capture in …
paddymul Mar 3, 2026
4012167
docs: Exp 2 P2 results — P=5 fails with system idle, not resource con…
paddymul Mar 3, 2026
e6ea620
fix: add --disable-dev-shm-usage to Chromium for Docker P=5+ support
paddymul Mar 3, 2026
338e40e
docs: Exp 2 complete — /dev/shm fix unlocks P=9 (49s, down from 94s)
paddymul Mar 3, 2026
228c7f7
feat: add Exp 4 back-to-back degradation test script
paddymul Mar 3, 2026
f82a1b4
docs: all experiments complete — /dev/shm fix resolves everything
paddymul Mar 4, 2026
176f6f6
feat: integrate /dev/shm fix — bump PARALLEL 4→9, settle 0, add --dis…
paddymul Mar 4, 2026
29b19fa
feat: delay smoke-test-extras, tighten stagger 5→2s, add MCP/server t…
paddymul Mar 4, 2026
1c49a02
feat: bind-mount CI runner scripts — no rebuild needed for script cha…
paddymul Mar 4, 2026
fd85f0a
fix: use awk instead of bc for timing (bc not in container)
paddymul Mar 4, 2026
676161f
docs: update experiments doc with Exp 40-41 results + bind-mount infra
paddymul Mar 4, 2026
c26897f
fix: clean all 9 jupyter ports (8889-8897) in pre-run cleanup
paddymul Mar 4, 2026
37aed6b
feat: remove all stagger delays — launch everything simultaneously
paddymul Mar 4, 2026
6c8590d
fix: restore 2s stagger between wheel-dependent jobs
paddymul Mar 4, 2026
7626c67
fix: cleanup esbuild, pw-results, expand port range to 8889-8897
paddymul Mar 4, 2026
09c6faa
fix: bump CI watchdog 210s → 360s for cold-start tolerance
paddymul Mar 4, 2026
7971896
docs: update CI research with Exp 42 results (64GB server, 2s stagger)
paddymul Mar 4, 2026
60f35f7
perf: start 9 JupyterLab servers in parallel during warmup
paddymul Mar 4, 2026
eca76b9
perf: remove 2s stagger between pw-jupyter Chromium launches
paddymul Mar 4, 2026
ef91d94
fix: use 0.5s stagger for pw-jupyter Chromium launches (0s hangs)
paddymul Mar 4, 2026
233a604
fix: use 1s stagger for pw-jupyter (0.5s still hangs 4/9)
paddymul Mar 4, 2026
a6de142
fix: revert pw-jupyter stagger to 2s (0s/0.5s/1s all fail)
paddymul Mar 4, 2026
6a3f4ba
perf: reduce CI timeout from 6min to 4min for faster iteration
paddymul Mar 4, 2026
28ae719
fix: increase Docker /dev/shm to 2GB for 9 concurrent Chromium instances
paddymul Mar 4, 2026
45824a0
fix: defer heavy jobs until pw-jupyter finishes
paddymul Mar 4, 2026
8fbc46f
exp: try 1.5s stagger for pw-jupyter Chromium launches
paddymul Mar 4, 2026
5cbd74f
fix: revert stagger to 2s — 1.5s fails on back-to-back runs
paddymul Mar 4, 2026
af66c90
exp: parallelize marimo tests (workers:2) + kill stale ipykernel
paddymul Mar 4, 2026
2729686
fix: set marimo workers:2 unconditionally + add CI=true to docker env
paddymul Mar 4, 2026
7e94041
fix: persist pnpm store as Docker volume
paddymul Mar 4, 2026
d77aff2
fix: revert marimo workers to 1 — workers:2 crashes server
paddymul Mar 4, 2026
1432432
fix: add shamefully-hoist for pnpm storybook compatibility
paddymul Mar 4, 2026
33b31ab
fix: move storybook after build-js to fix pnpm install race
paddymul Mar 4, 2026
f0c3b91
docs: record exp 6/7 results + storybook semver fix
paddymul Mar 4, 2026
49b71ca
perf: overlap pw-marimo with pw-jupyter
paddymul Mar 4, 2026
9e67c37
perf: overlap pw-wasm-marimo + pw-server with pw-jupyter
paddymul Mar 4, 2026
233398a
perf: overlap test-python with pw-jupyter
paddymul Mar 4, 2026
634452d
fix: revert test-python overlap — causes pw-jupyter 120s timeout
paddymul Mar 4, 2026
6f19419
docs: record job overlap experiment results
paddymul Mar 4, 2026
aeb76f7
fix: harden cloud-init for fresh box provisioning
paddymul Mar 4, 2026
82c148b
docs: final experiment results + summary
paddymul Mar 4, 2026
45d10c8
fix: defer Playwright overlap on ≤16 vCPU (VX1 pw-jupyter hangs)
paddymul Mar 4, 2026
c566bf1
docs: VX1 Zen 5 experiment results — 16 vCPU insufficient for P=9
paddymul Mar 4, 2026
2e86252
fix: P=5 pw-jupyter + stale process cleanup for VX1 16 vCPU
paddymul Mar 4, 2026
cd51c9e
fix: upgrade jupyter stack + add version capture for CI reproducibility
paddymul Mar 4, 2026
f65e8de
fix: harden b2b cleanup + add container state snapshots
paddymul Mar 4, 2026
0103187
fix: set JUPYTER_PARALLEL=9 to eliminate batch server reuse
paddymul Mar 4, 2026
f33905c
docs: pw-jupyter batch server reuse root cause + fix
paddymul Mar 4, 2026
5b85d83
feat: restore parallel DAG + cross-size validation results
paddymul Mar 4, 2026
61bf303
exp: try 0s stagger in pw-jupyter (P=9 fix eliminates server reuse)
paddymul Mar 4, 2026
93a425d
perf: speed up jupyter warmup — reuse Docker venv + parallel polling
paddymul Mar 4, 2026
e0e640a
docs: update experiment results — 0s stagger + warmup optimization
paddymul Mar 4, 2026
2f44b86
perf: async build-wheel with renice -10 to unblock critical path
paddymul Mar 4, 2026
3a7697e
perf: ramdisk for CI working directory — all build I/O in RAM
paddymul Mar 4, 2026
44da8cb
fix: ramdisk — use tar pipe instead of rsync, fix hardcoded /repo paths
paddymul Mar 4, 2026
ff6f1b3
fix: ramdisk exec permission + pnpm cross-fs hardlink
paddymul Mar 4, 2026
2b14cdf
fix: copy pnpm store to ramdisk so hardlinks work (same fs)
paddymul Mar 4, 2026
4cd41e3
revert: remove in-container ramdisk approach
paddymul Mar 4, 2026
740273a
perf: host-level tmpfs for repo + pnpm store (same filesystem)
paddymul Mar 4, 2026
5243859
docs: tmpfs ramdisk experiment results — not worth it
paddymul Mar 4, 2026
69e46e0
feat: --fast-fail flag to abort CI after build-js or build-wheel failure
paddymul Mar 4, 2026
3528d5f
fix: skip redundant pnpm install in full_build.sh to prevent race wit…
paddymul Mar 4, 2026
e3b4d31
feat: --only/--skip job filters + 3-minute CI timeout
paddymul Mar 4, 2026
1455934
fix: ci_pkill excludes self PID to prevent cleanup suicide
paddymul Mar 4, 2026
efffe5b
docs: mark Exp 54-56 done, update current best config
paddymul Mar 4, 2026
45590ff
docs: Exp 58 stress test results + update stress-test.sh server IP
paddymul Mar 4, 2026
fa5e5a7
feat: enhanced CI filtering, tuning params, research scripts
paddymul Mar 4, 2026
6949f24
feat: populate NEW_COMMITS with 50 synth merge SHAs
paddymul Mar 4, 2026
e1399f8
docs: update experiment status (Exp 57/59/60 with scripts and results)
paddymul Mar 4, 2026
6ddd4c0
docs: Exp 60 results — renice has no effect on 16C
paddymul Mar 4, 2026
4a3a763
fix: stress test b2b failures — xdist + node_modules
paddymul Mar 4, 2026
ae19ed2
fix: wipe package-level node_modules on lockfile change
paddymul Mar 4, 2026
a8dfb1b
fix: export npm_config_store_dir to prevent pnpm store mismatch
paddymul Mar 5, 2026
8f13a52
fix: skip test-mcp-wheel on commits that predate MCP
paddymul Mar 5, 2026
c8787af
fix: mcp skip check on correct file
paddymul Mar 5, 2026
c85d70c
docs: Exp 63 stress test results — 50 new commits, infra fixes validated
paddymul Mar 5, 2026
7f4bb2a
docs: Exp 57/62/64 results — tuning sweep, pytest workers, tsgo/vitest
paddymul Mar 5, 2026
0ed5754
feat: ci-gantt.py — animated GIF Gantt chart of CI job timings
paddymul Mar 5, 2026
6741cab
fix: ci-gantt — high-contrast colors, full labels, gate lines
paddymul Mar 5, 2026
6d60b39
fix: ci-gantt — vertical stacking, SHA:label syntax, aligned x axes
paddymul Mar 5, 2026
fcdc9ec
feat: ci-gantt static JPEG output with compact wide layout
paddymul Mar 5, 2026
0bfc00a
feat: sort gantt rows by actual start time (data-driven wave grouping)
paddymul Mar 5, 2026
804e4b9
feat: red y-axis labels for failing jobs in gantt chart
paddymul Mar 5, 2026
055c1a1
feat: cpu-fine.log append per-run + iowait column; gantt iowait overlay
paddymul Mar 5, 2026
220053c
fix: eliminate sqlite 'database is locked' under parallel pytest
paddymul Mar 5, 2026
698a77e
feat: timing_dependent pytest mark + high-priority split invocation
paddymul Mar 5, 2026
3a49251
fix: kill buckaroo.server process and port 8701 in pre-run cleanup
paddymul Mar 5, 2026
35d8a78
fix: ruff E702 semicolon style in ci-gantt.py
paddymul Mar 5, 2026
71b36ff
perf: run playwright-marimo and playwright-server in parallel (no sta…
paddymul Mar 5, 2026
58c745e
perf: cache mcp-test venv by wheel hash in playwright-server
paddymul Mar 5, 2026
4dc2f0b
perf: run marimo playwright tests with 2 workers (fullyParallel)
paddymul Mar 5, 2026
4fc5ffe
perf: start wheel-only jobs immediately after wheel build, not after …
paddymul Mar 5, 2026
81b9fca
perf: move test-js and test-python-3.11 to t0; defer 3.12+3.14 10s af…
paddymul Mar 5, 2026
f5db2b2
perf: overlap jupyter wheel install with warmup server polling
paddymul Mar 5, 2026
f72cb8a
feat: add --local flag to stress-test.sh for server-side execution
paddymul Mar 5, 2026
458c944
fix: pytest exit code 5 = pass; replace fuser with /proc/net/tcp port…
paddymul Mar 5, 2026
031c787
fix: add port 8765 to cleanup; overlay playwright configs in synth co…
paddymul Mar 5, 2026
669af98
chore: update TEST_SHA to 031c787e for synth commit regeneration
paddymul Mar 5, 2026
10cab69
chore: re-bake safe set synth commits with playwright configs in overlay
paddymul Mar 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
359 changes: 359 additions & 0 deletions docs/llm/research/CI-performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,359 @@
# CI Performance Research

**Date:** 2026-03-01
**Pipeline:** `.github/workflows/checks.yml`
**Runner provider:** [Depot](https://depot.dev)

## Summary

Buckaroo's CI runs 22 jobs across two waves on Depot GitHub Actions runners. The pipeline is **I/O-bound** — faster CPUs provide no measurable speedup. The blocking critical path is **~3.5 minutes**. Total cost is **~$0.18/run** on 2-CPU runners.

Tested 2-CPU, 4-CPU, and 8-CPU Depot runners (2 runs each, 4 test PRs total). Bigger runners cost 1.6–2.7x more with zero improvement. Run-to-run variance (±20s) exceeds differences between tiers.

---

## Pipeline Structure

Two waves of parallel jobs:

**Wave 1** (no dependencies, all start immediately):
- LintPython, TestJS, BuildWheel, CheckDocs, StylingScreenshots
- TestPython × 4 versions (3.11, 3.12, 3.13, 3.14)
- TestPythonMaxVersions × 4 versions
- TestPythonWindows

**Wave 2** (`needs: [BuildWheel]`):
- TestStorybook, TestServer, TestJupyterLab, TestMarimo, TestWASMMarimo
- TestMCPWheel, SmokeTestExtras, PublishTestPyPI

---

## Latency: Commit to Code Running

Measured via live experiment — created a PR and timed each phase with second-level precision.

```
git commit T+0s
git push complete T+6s (local git + SSH)
GH webhook fires T+10s (run created)
Job assigned T+12s (queued to runner)
"Set up job" starts T+30s (Linux) / T+97s (Windows)
```

| Phase | Latency |
|-------|---------|
| Commit → push complete | 6s |
| Push → GitHub run created | 4s |
| Run created → job assigned | 2s |
| **Depot Linux provisioning** | **18s** |
| **Depot Windows provisioning** | **85s** |

**GitHub adds ~6s.** The rest is Depot runner boot time.

### BuildWheel → Wave 2 Gap

Measured across 5 runs:

| Phase | Time |
|-------|------|
| BuildWheel completes → jobs queued | 2–16s |
| Jobs queued → "Set up job" starts | 17–25s |
| **Total dead time on critical path** | **20–35s** |

This provisioning gap is paid every wave transition. Not controllable.

---

## Runner Tier Comparison

### Methodology

Created test branches changing `depot-ubuntu-latest` → `depot-ubuntu-latest-{4,8}` and `depot-windows-2025` → `depot-windows-2025-{4,8}`. Ran each tier twice via separate PRs.

### Per-Job Results (format: run 1 / run 2)

| Job | 2-CPU (baseline) | 4-CPU | 8-CPU |
|-----|---------|-------|-------|
| Python / Lint | 0:32 | 0:31 / 0:32 | 0:33 / 0:29 |
| JS / Build + Test | 0:53 | 0:51 / 0:49 | 0:50 / 0:48 |
| Build JS + Python Wheel | 0:59 | 1:00 / 0:52 | 0:50 / 1:36 |
| Docs / Build + Check Links | 1:05 | 1:05 / 1:00 | 0:54 / 1:00 |
| Python / Test (3.11) | 1:45 | 1:43 / 2:01 | 1:40 / 1:41 |
| Python / Test (3.12) | 1:44 | 1:42 / 1:49 | 1:41 / 1:44 |
| Python / Test (3.13) | 1:39 | 1:44 / 1:36 | 1:36 / 2:00 |
| Python / Test (3.14) | 1:35 | 1:34 / 1:52 | 1:33 / 1:34 |
| MaxVer (3.11) | 1:41 | 1:40 / 1:37 | 1:40 / 1:39 |
| MaxVer (3.12) | 1:41 | 1:41 / 1:39 | 1:40 / 2:03 |
| MaxVer (3.13) | 1:41 | 1:41 / 1:38 | 1:39 / 1:37 |
| MaxVer (3.14) | 1:37 | 1:32 / 1:42 | 1:34 / 1:34 |
| Smoke / Optional Extras | 0:47 | 0:47 / 0:47 | 0:45 / 0:43 |
| MCP / Integration | 0:48 | 0:48 / 1:01 | 0:44 / 0:44 |
| Marimo Playwright | 1:30 | 1:28 / 1:37 | 1:22 / 1:24 |
| WASM Marimo Playwright | 1:40 | 1:05 / 1:10 | 1:16 / 1:12 |
| Server Playwright | 2:05 | 1:35 / 1:38 | 1:34 / 1:37 |
| Storybook Playwright | 1:53 | 1:49 / 1:45 | 2:08 / 2:14 |
| JupyterLab Playwright | 2:03 | 2:08 / 2:01 | 2:34 / 2:18 |
| Windows | 8:02 | 8:20 / 7:14 | 7:54 / 7:24 |

### Observations

- **No consistent speedup.** Variance between runs of the same tier is larger than differences between tiers.
- **Some jobs slower on bigger runners.** Storybook Playwright: 1:53 (2-CPU) → 2:08/2:14 (8-CPU). JupyterLab Playwright: 2:03 → 2:34/2:18.
- **Windows unaffected.** 8:02 → 7:14–8:20 range across all tiers.

### Cost

| Tier | Linux $/min | Windows $/min | Cost/run | vs 2-CPU |
|------|-------------|---------------|----------|----------|
| 2-CPU | $0.004 | $0.008 | **$0.18** | 1x |
| 4-CPU | $0.008 | $0.016 | **$0.28** | 1.6x |
| 8-CPU | $0.016 | $0.032 | **$0.49** | 2.7x |

**Verdict: Stay on 2-CPU.** Paying 1.6–2.7x more for no improvement.

---

## Where Time Goes Inside Jobs

Step-level analysis from a baseline 2-CPU run.

### Typical Linux Job (Python / Test 3.13 — 1m39s total)

```
Set up job 2s
Checkout 5s
Install uv 3s
Setup js files 1s
Install the project 1s
Run tests 62s ← actual work
Codecov 4s
Post steps 1s
```

12s overhead, 62s useful work (84% efficient).

### BuildWheel (0:59 total)

```
Set up job 2s
Checkout 6s
Install uv 5s
Setup pnpm + Node 3s
Install pnpm deps 2s
Install project 2s
Build JS + wheel 16s ← actual work
Upload artifacts 1s
```

20s overhead, 16s useful work. The build itself is fast.

### Windows (8:02 total)

```
Set up job 2s
Checkout 43s ← 9x slower than Linux
Install uv 3m29s ← 70x slower than Linux
Setup js files 1s
Install project 27s ← 27x slower than Linux
Run tests 1m52s ← actual work
Post steps 3s
```

**4m41s of overhead for 1m52s of tests (28% efficient).** The `Install uv` step alone is 3.5 minutes. Already has `continue-on-error: true`.

### Playwright Jobs (JupyterLab — 2:03 total, longest wave 2 job)

```
Set up job 2s
Checkout 5s
Install uv 3s
Setup pnpm + Node 7s
Install pnpm deps 2s
Download artifacts 0s
Install project 2s
Cache Playwright 2s
Run tests 77s ← actual work
Post steps 1s
```

23s overhead, 77s useful work. These tests validate the built wheel — they must depend on BuildWheel.

---

## Critical Path Analysis

```
0:00 Wave 1 starts
0:59 BuildWheel completes (16s actual build + 43s overhead)
1:24 Wave 2 starts running (~25s Depot provisioning gap)
3:27 JupyterLab Playwright completes (longest wave 2 job)
```

**Blocking critical path: ~3:27.** Window job (8:02) runs in parallel but is non-blocking (`continue-on-error: true`).

---

## Path-Gated Optimizations (PR-only, merge queue runs full CI)

The key insight: **`merge_group` always runs the full pipeline** (current behavior, no changes). The optimizations below only apply to the `pull_request` event, where fast iteration matters more than exhaustive coverage.

### Approach: Two-tier CI

```
pull_request: Run reduced CI based on what changed
merge_group: Run full CI (current behavior, unchanged)
push to main: Run full CI (current behavior, unchanged)
```

This is safe because nothing merges without passing the merge queue.

### How Often Do PRs Touch JS vs Python vs file_cache?

Analysis of the last 20 merged PRs:

| Area | PRs touching it | % of PRs |
|------|----------------|----------|
| `packages/` (JS) | 9 of 20 | 45% |
| `buckaroo/file_cache/` | 0 of 20 | 0% |
| Python only (no JS) | 11 of 20 | 55% |

### Optimization 1: Skip JS-only jobs when `packages/` unchanged

Buckaroo is an integrated system — Python drives JS rendering, so **Playwright integration tests must always run** regardless of what changed. A Python-only change to styling, stats, or column config can break what renders in the browser.

However, when `packages/` hasn't changed, the **JS unit tests** are redundant — they'd be testing the same JS code that already passed on `main`.

When a PR only touches Python code:
- Skip `TestJS` (0:53) — JS unit tests, no Python involvement
- `BuildWheel` uses cached JS build artifacts from `main` instead of rebuilding the JS (still builds the Python wheel around them)

**What still runs on Python-only PRs (everything else):**
- LintPython, CheckDocs, BuildWheel (with cached JS), all Python test matrix entries
- All Playwright integration tests (Storybook, Server, JupyterLab, Marimo, WASM Marimo)
- SmokeTestExtras, TestMCPWheel, StylingScreenshots

**Impact for the 55% of PRs that are Python-only:**
- Saves 1 job (~$0.004) and a small amount of BuildWheel time
- The main win is `BuildWheel` completing faster (skip the 16s esbuild), which means wave 2 Playwright jobs start ~16s sooner
- Critical path drops from ~3:27 to ~3:11

This is a modest win. The real value is correctness: by caching known-good JS artifacts, Python-only PRs are tested against the exact JS that's on `main`, not a redundant rebuild of the same source.

### Optimization 2: Skip file_cache tests when `buckaroo/file_cache/` unchanged

The `tests/unit/file_cache/` suite is **74% of total Python test time**:

| Test group | Tests | Time | % of total |
|---|---|---|---|
| `tests/unit/file_cache/` | 51 | **42.8s** | **74%** |
| Everything else | 570 | **14.9s** | **26%** |

This is because `mp_timeout` tests spawn real subprocesses with real timeouts (0.8–1.0s each, some at 3×). Each test that exercises a timeout path waits for the actual timeout to expire.

In the last 20 merged PRs, **zero** touched `buckaroo/file_cache/`. When it does get touched, it's critical to test thoroughly. But running 43s of subprocess timeout tests on every Python-only PR that changes a formatter or stat function is waste.

**Mechanism:** Use `dorny/paths-filter` to detect changes to `buckaroo/file_cache/**` or `tests/unit/file_cache/**`. If unchanged:
- Add `-m "not file_cache"` to the pytest invocation (requires adding a `file_cache` marker to the tests)
- Or simpler: `--ignore=tests/unit/file_cache`

**Impact:**
- Python test jobs drop from ~62s to ~15s actual test time
- Total job time drops from ~1:40 to ~0:30 per matrix entry
- 8 matrix entries × ~70s saved = ~9.3 minutes of job-time saved
- At $0.004/min = ~$0.04 saved per run
- **Critical path drops by ~47s** (Python tests are no longer on the critical path at all — BuildWheel→Playwright becomes the bottleneck again, but only when JS changes)

### Combined Impact

For the **55% of PRs that are Python-only and don't touch file_cache** (the common case):

| | Current | Optimized | Saved |
|--|---------|-----------|-------|
| Jobs run | 22 | 21 | 1 fewer |
| Python test time | ~62s | ~15s | **~47s** |
| Critical path | ~3:27 | ~2:40 | **~47s** |
| Cost/run | ~$0.18 | ~$0.14 | ~$0.04 |

The critical path improvement comes from file_cache skipping — Python tests drop from ~1:40 to ~0:30 per job, so they're no longer close to the Playwright critical path. The JS artifact caching shaves ~16s off BuildWheel, letting wave 2 start slightly sooner.

For the **45% of PRs that touch JS but not file_cache**:

| | Current | With file_cache skip | Saved |
|--|---------|---------------------|-------|
| Jobs run | 22 | 22 | 0 |
| Critical path | ~3:27 | ~2:40 | **~47s** |
| Cost/run | ~$0.18 | ~$0.14 | ~$0.04 |

The merge queue always runs the full 22-job pipeline regardless.

---

## Other Optimization Opportunities

### Move Windows to nightly schedule

| | Current | Proposed |
|--|---------|----------|
| Trigger | Every PR | `schedule` cron + push to main |
| Savings | — | $0.06/run |
| Risk | None | Late detection of Windows-specific bugs |

Already `continue-on-error: true` so it cannot block merges. Running it on every PR burns $0.06 and 8 minutes of wall-clock noise for a job that by definition can't fail the build.

### Reduce Python matrix from 8 → 3 jobs

Currently 4 Python versions × 2 dep strategies = 8 jobs. Proposed:
- Normal deps: 3.11 + 3.14 (oldest + newest)
- Max versions: 3.14 only

Middle versions (3.12, 3.13) rarely catch issues that 3.11 + 3.14 don't. Saves ~$0.03/run and 5 fewer runners to provision. (Could also be PR-only, with merge queue running the full matrix.)

### Path-filter Styling Screenshots

Only run when PRs touch styling-related files (`styling*.py`, `Styler.tsx`, etc.). Most PRs don't touch styling code. When it does run it takes 2:10 (two Storybook cold starts).

### Merge small jobs to reduce provisioning overhead

Each job pays ~20s Depot provisioning + ~15s setup. Candidates:
- **MCP + Smoke** → one job (both need just uv + wheel, 48s + 47s actual)
- **Marimo + WASM Marimo** → one job (identical 23s setup each)

No wall-clock improvement (they run in parallel), but reduces total job-minutes and Depot costs.

### Drop codecov from 3 of 4 Python test entries

Only one coverage report needed. Saves ~12s total.

### What won't help

- **Faster runners** — proven by 4-run experiment. I/O-bound workload.
- **Removing BuildWheel dependency from Playwright** — those tests validate the built wheel works as shipped. That's the point.

---

## Depot Pricing Reference

| Plan | Price | Included | Overage |
|------|-------|----------|---------|
| Developer | $20/mo | 2,000 min | — |
| Startup | $200/mo | 20,000 min | $0.004/min |
| Business | Custom | Custom | Custom |

Runner rates (per minute, billed per second, no minimum):

| Size | Linux | Windows |
|------|-------|---------|
| 2 CPU / 8 GB | $0.004 | $0.008 |
| 4 CPU / 16 GB | $0.008 | $0.016 |
| 8 CPU / 32 GB | $0.016 | $0.032 |
| 16 CPU / 64 GB | $0.032 | $0.064 |

### Monthly cost at current usage (2-CPU)

| Runs/month | Total minutes | Cost |
|------------|--------------|------|
| 50 | ~1,900 | ~$9 |
| 100 | ~3,800 | ~$18 |
| 200 | ~7,600 | ~$36 |

Fits comfortably on Developer plan at moderate usage.
Loading
Loading