[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models) by wenhaochai · Pull Request #150 · FrontierCS/Frontier-CS

wenhaochai · 2026-06-13T15:26:15Z

Summary

Two paired Frontier-CS 2.0 tasks built from Nano World Models (arXiv:2605.23993,
simchowitzlabpublic/nano-world-model, MIT) — a frozen NanoWM-L/2 CSGO diffusion
world model. Same shape as vllm_llm_serving_optimization (#145): submit a
Python-only patch to source, the judge applies it and measures a continuous
metric with a guardrail, GPU runs on Modal (CPU judge). No changes to the
shared 2.0 adapter/template — each task is self-contained + per-problem Modal.

The two are a dual pair on the same frozen model:

`nanowm_rollout_speedup` — minimize inference latency at iso-quality

Agent patches the diffusion sampling layer to speed up a fixed 50-step,
50-frame CSGO rollout without losing quality.
score = clip(100·log2(geomean_speedup),0,100) · quality_mult (LPIPS-vs-GT
guardrail ≤3%). The CSGO step↔quality frontier is real (seq@2 = +31% LPIPS vs
seq@50, monotonic). Reference (bf16 autocast) = 1.17× at iso-quality (LPIPS +1.7%, within the 3% guardrail), score 22.3.

`nanowm_rollout_stability` — minimize long-horizon drift at iso-wall-clock

The dual: patch the sampling layer to reduce 80-frame tail drift (mean
LPIPS-vs-GT over frames ≥60) under a fixed wall-clock budget (so drift isn't
bought with compute — that's the speedup axis).
score = clip(100·(base_tail−patched_tail)/base_tail,0,100) · wallclock_mult.
Reference (history-stabilization bump) beats baseline robustly under common-random
-numbers pairing (74% per-clip win; pooled paired t=5.15, p<1e-4, Wilcoxon p<1e-4
over 3 seeds × 22 clips) = 6.8% ± 1.2% drift reduction, score 6.8 (iso-wall-clock).

Validation (Della H100)

Both run the real evaluator.py end-to-end via a local-GPU backend (Modal
stand-in): patch-policy validation → patch apply → CSGO rollout → metric →
guardrail → scoring. Canonical, seeded, paired results: speedup 1.17× / score
22.3 (LPIPS +1.7%, region-timed); stability 6.8% ± 1.2% / score 6.8 (3 seeds,
pooled paired t=5.15, p<1e-4). Patch policy accepts the references and rejects
metric edits + env-var leakage; the CPU smoke path (FRONTIER_NWM_SMOKE=1)
validates the policy + passes the empty reference for offline CI.

Adversarial audit & hardening

These tasks were put through a multi-agent adversarial audit (5 independent
auditors + refute-pass verifiers; full report + fix-validation table in each task's
AUDIT_REPORT.md, included in this PR). The strong attacks were refuted on the artifacts: the
fp32 speedup baseline is fair (the native --use_fp16 flag is denylisted, so fp32
is the genuine in-scope default — not a strawman), there is no train/test leakage
(22/22 held-out clips ∈ test split, 0 ∈ train), CSGO's frontier is real, and the
RT-1/Maze rejections hold.

One high-severity issue surfaced and is fixed: the rollout RNG was unseeded
and the baseline cached-not-paired, so ~2–3% run-to-run noise (≈ the effect sizes)
made a no-op patch score nonzero (gameable). Fix (judge infrastructure, outside the
agent's editable scope; shared template untouched): deterministic clip-keyed
seeding (baseline & patched draw identical per-clip initial noise — common random
numbers) + a real per-region sampling timer. Re-validated: a no-op patch now
scores 0.15 / 0.000 ≈ 0, per-clip LPIPS is bit-identical across repeats, residual
wall-noise dropped to 0.24%, the audit's impossible −4.97% bf16 "improvement"
collapsed to the expected +1.7%, and the stability effect tightened to p<1e-4
across 3 seeds. Lower-severity items (doc number reconciliation, multi-seed stats,
LPIPS-table labeling) are also addressed.

Patch policy

Python-only, ≤256 KB. Allowed: src/diffusion/**.py,
src/sample/sampling_utils.py. Denied: model/VAE/metric/rollout-harness/data/
training; native/build files; benchmark detection, env-var leakage, timing
short-circuits. The rollout invocation is fixed by the judge; the agent changes
sampler internals only.

CI status (`validate-benchmark20`)

This check is expected red for these tasks, exactly like #145 — and main has
no required status checks, so it is non-blocking. scripts/validate_problems.py
runs each task's reference inside the task's Docker image; for a Modal-GPU task the
real image bakes the NanoWM checkout + L/2 CSGO ckpt + held-out subset and is built
locally via docker/build_images.sh, so it is not on a public registry CI can
pull. The job therefore fails at docker pull with
pull access denied ... repository does not exist for both tasks — the same point
#145 fails at. There is no green path for a Modal-GPU 2.0 task without either a
GPU + published image + Modal token in CI, or editing the shared validator (which
this PR deliberately does not touch).

Everything up to that point is now well-formed (this was the one fix in the latest
push): language: python resolves, reference.py is found, the evaluator imports
and — on the no-GPU smoke path — returns 1.0 for both tasks locally. Real
correctness is the Della H100 end-to-end runs above; once the images are published
to a registry the check goes green unchanged.

Notes for maintainers

GPU via Modal (*/modal_app.py, mirrors [2.0] Add new Frontier-CS 2.0 problem vllm_llm_serving_optimization #145); the judge stays CPU. End-to-end
Modal deploy is pending Modal credentials — the local-backend H100 runs above are
the validated proof; Modal only swaps the GPU location. A turnkey deploy+test
script is included in the task working repo.
Hidden assets (L/2 CSGO ckpt + held-out CSGO episode subset + cached baseline)
are baked into the judge image at build time via each task's docker/build_images.sh
— not committed here.
An RT-1 domain variant was evaluated and rejected: RT-1 is over-provisioned
(quality peaks at ~5 steps), so speedup is trivial. CSGO is uniquely suited
(monotonic step↔quality from the complex 16-frame-window model).

🤖 Generated with Claude Code

…> CSGO rollout -> speedup + LPIPS guardrail) + bf16 reference.patch

…SIGN, evaluator (patch policy validated: accepts ref, rejects metric-edit+env-leak), orchestrate+modal_app (Modal GPU, judge CPU), harbor/app, docker agent+judge, infra patch, evaluate.sh. Patch-policy + smoke validated on CPU; GPU runner validating on H100

…, score 22.4, qmult 1.0) — end-to-end H100 local backend

…frame tail-drift at iso-wall-clock. Reference (stab=0.20) reliably beats baseline (t~2.5/22 clips). Reuses framework; patch policy + smoke validated

…il-drift reduction @iso-wall-clock, score 5.54 > baseline)

…ong rollouts

The CI validator (scripts/validate_problems.py -> get_language_config) only supports {python, cpp, rust}; `language: patch` raised `ValueError: Unsupported language: patch` and crashed the whole validate step with a traceback before any per-problem logic ran. Mirror vllm_llm_serving_optimization (#145), the canonical Modal-GPU 2.0 task: declare `language: python`, keep the real solution in reference.patch, and make reference.py a docstring-only placeholder. Also align the tag to `systems` (matches #145 and the sibling systems-optimization tasks duckdb_e2e_query_optimization / vector_db_ann). Submission contract is unchanged: agents still submit /app/solution.patch; the static patch policy + Modal-GPU scoring in {speedup,stability}_eval are untouched. The judge stays CPU-only (no runtime.docker.gpu), exactly like #145. Verified locally: language resolves to python, reference.py is found, both evaluators return 1.0 on the no-GPU smoke path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

An adversarial audit found the rollout metric was unseeded and the baseline cached-not-paired: ~2-3% run-to-run noise (≈ the effect sizes) let a no-op patch score nonzero (gameable), and the bf16 reference showed an impossible -4.97% LPIPS "improvement" that was pure noise. Fix (judge infrastructure, outside the agent's editable sampling scope; the shared 2.0 adapter/template is untouched): - Deterministic clip-keyed seeding in the rollout harness so the baseline (unpatched) and patched arms draw identical per-clip initial noise (common random numbers); batch boundaries are clip-aligned so QUICK is a noise-identical prefix of FINAL. Regenerated into infra_patches/0001-rollout-judge-infra.patch (was chunked-decode only). - Real per-region sampling timer (writes NWM_TIME_FILE) so the speedup metric isolates the patchable region (was dead code -> full-process wall-clock). - *_eval: settings add SEED; runner passes --seed; orchestrate baseline cache keyed by (== clips, seed) so a cached baseline is a valid CRN partner. Re-validated on Della H100 (3 seeds): a no-op patch now scores 0.15 / 0.000 (ungameable), residual wall noise 0.24%, bf16 LPIPS delta +1.7% (expected), stability drift reduction 6.8% +/- 1.2% pooled paired t=5.15 p<1e-4. Canonical numbers reconciled across DESIGN/PR_SUMMARY (speedup 1.17x/22.3, stability 6.8%/6.8); LPIPS tables labeled. Standing audit conclusions C1-C6 unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds AUDIT_REPORT.md (the multi-agent adversarial audit + the "fixes applied + validated" table) to both task dirs so reviewers can see the provenance: the refuted strong attacks (fair fp32 baseline, no leakage, real frontier), the one HIGH finding (unseeded/cached metric) and its deterministic-seeding fix, and the post-fix validation numbers. Absolute working-repo paths rewritten to relative. PR body reference updated to point here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wenhaochai and others added 10 commits June 13, 2026 11:22

nanowm_rollout_speedup: core — scoring/settings/runner (apply patch -…

7702140

…> CSGO rollout -> speedup + LPIPS guardrail) + bf16 reference.patch

nanowm_rollout_speedup: 2.0/README entry

02a8d6b

nanowm_rollout_speedup: record validated reference result (1.17x bf16…

55940fc

…, score 22.4, qmult 1.0) — end-to-end H100 local backend

nanowm_rollout_stability: drift task (dual of speedup) — minimize 80-…

5065f7c

…frame tail-drift at iso-wall-clock. Reference (stab=0.20) reliably beats baseline (t~2.5/22 clips). Reuses framework; patch policy + smoke validated

nanowm_rollout_stability: validated end-to-end (stab=0.20 ref 5.5% ta…

e5e71a7

…il-drift reduction @iso-wall-clock, score 5.54 > baseline)

nanowm_rollout_speedup: infra patch = frame-chunked decode (24) for l…

e0ec414

…ong rollouts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150

[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150
wenhaochai wants to merge 10 commits into
mainfrom
problem/nanowm-rollout-tasks

wenhaochai commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wenhaochai commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

nanowm_rollout_speedup — minimize inference latency at iso-quality

nanowm_rollout_stability — minimize long-horizon drift at iso-wall-clock

Validation (Della H100)

Adversarial audit & hardening

Patch policy

CI status (validate-benchmark20)

Notes for maintainers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wenhaochai commented Jun 13, 2026 •

edited

Loading

`nanowm_rollout_speedup` — minimize inference latency at iso-quality

`nanowm_rollout_stability` — minimize long-horizon drift at iso-wall-clock

CI status (`validate-benchmark20`)