[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150
Open
wenhaochai wants to merge 10 commits into
Open
[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150wenhaochai wants to merge 10 commits into
wenhaochai wants to merge 10 commits into
Conversation
…> CSGO rollout -> speedup + LPIPS guardrail) + bf16 reference.patch
…SIGN, evaluator (patch policy validated: accepts ref, rejects metric-edit+env-leak), orchestrate+modal_app (Modal GPU, judge CPU), harbor/app, docker agent+judge, infra patch, evaluate.sh. Patch-policy + smoke validated on CPU; GPU runner validating on H100
…, score 22.4, qmult 1.0) — end-to-end H100 local backend
…frame tail-drift at iso-wall-clock. Reference (stab=0.20) reliably beats baseline (t~2.5/22 clips). Reuses framework; patch policy + smoke validated
…il-drift reduction @iso-wall-clock, score 5.54 > baseline)
The CI validator (scripts/validate_problems.py -> get_language_config) only
supports {python, cpp, rust}; `language: patch` raised
`ValueError: Unsupported language: patch` and crashed the whole validate
step with a traceback before any per-problem logic ran.
Mirror vllm_llm_serving_optimization (#145), the canonical Modal-GPU 2.0
task: declare `language: python`, keep the real solution in reference.patch,
and make reference.py a docstring-only placeholder. Also align the tag to
`systems` (matches #145 and the sibling systems-optimization tasks
duckdb_e2e_query_optimization / vector_db_ann).
Submission contract is unchanged: agents still submit /app/solution.patch;
the static patch policy + Modal-GPU scoring in {speedup,stability}_eval are
untouched. The judge stays CPU-only (no runtime.docker.gpu), exactly like
#145. Verified locally: language resolves to python, reference.py is found,
both evaluators return 1.0 on the no-GPU smoke path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
An adversarial audit found the rollout metric was unseeded and the baseline cached-not-paired: ~2-3% run-to-run noise (≈ the effect sizes) let a no-op patch score nonzero (gameable), and the bf16 reference showed an impossible -4.97% LPIPS "improvement" that was pure noise. Fix (judge infrastructure, outside the agent's editable sampling scope; the shared 2.0 adapter/template is untouched): - Deterministic clip-keyed seeding in the rollout harness so the baseline (unpatched) and patched arms draw identical per-clip initial noise (common random numbers); batch boundaries are clip-aligned so QUICK is a noise-identical prefix of FINAL. Regenerated into infra_patches/0001-rollout-judge-infra.patch (was chunked-decode only). - Real per-region sampling timer (writes NWM_TIME_FILE) so the speedup metric isolates the patchable region (was dead code -> full-process wall-clock). - *_eval: settings add SEED; runner passes --seed; orchestrate baseline cache keyed by (== clips, seed) so a cached baseline is a valid CRN partner. Re-validated on Della H100 (3 seeds): a no-op patch now scores 0.15 / 0.000 (ungameable), residual wall noise 0.24%, bf16 LPIPS delta +1.7% (expected), stability drift reduction 6.8% +/- 1.2% pooled paired t=5.15 p<1e-4. Canonical numbers reconciled across DESIGN/PR_SUMMARY (speedup 1.17x/22.3, stability 6.8%/6.8); LPIPS tables labeled. Standing audit conclusions C1-C6 unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds AUDIT_REPORT.md (the multi-agent adversarial audit + the "fixes applied + validated" table) to both task dirs so reviewers can see the provenance: the refuted strong attacks (fair fp32 baseline, no leakage, real frontier), the one HIGH finding (unseeded/cached metric) and its deterministic-seeding fix, and the post-fix validation numbers. Absolute working-repo paths rewritten to relative. PR body reference updated to point here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two paired Frontier-CS 2.0 tasks built from Nano World Models (arXiv:2605.23993,
simchowitzlabpublic/nano-world-model, MIT) — a frozen NanoWM-L/2 CSGO diffusionworld model. Same shape as
vllm_llm_serving_optimization(#145): submit aPython-only patch to source, the judge applies it and measures a continuous
metric with a guardrail, GPU runs on Modal (CPU judge). No changes to the
shared 2.0 adapter/template — each task is self-contained + per-problem Modal.
The two are a dual pair on the same frozen model:
nanowm_rollout_speedup— minimize inference latency at iso-qualityAgent patches the diffusion sampling layer to speed up a fixed 50-step,
50-frame CSGO rollout without losing quality.
score = clip(100·log2(geomean_speedup),0,100) · quality_mult(LPIPS-vs-GTguardrail ≤3%). The CSGO step↔quality frontier is real (seq@2 = +31% LPIPS vs
seq@50, monotonic). Reference (bf16 autocast) = 1.17× at iso-quality (LPIPS +1.7%, within the 3% guardrail), score 22.3.
nanowm_rollout_stability— minimize long-horizon drift at iso-wall-clockThe dual: patch the sampling layer to reduce 80-frame tail drift (mean
LPIPS-vs-GT over frames ≥60) under a fixed wall-clock budget (so drift isn't
bought with compute — that's the speedup axis).
score = clip(100·(base_tail−patched_tail)/base_tail,0,100) · wallclock_mult.Reference (history-stabilization bump) beats baseline robustly under common-random
-numbers pairing (74% per-clip win; pooled paired t=5.15, p<1e-4, Wilcoxon p<1e-4
over 3 seeds × 22 clips) = 6.8% ± 1.2% drift reduction, score 6.8 (iso-wall-clock).
Validation (Della H100)
Both run the real
evaluator.pyend-to-end via a local-GPU backend (Modalstand-in): patch-policy validation → patch apply → CSGO rollout → metric →
guardrail → scoring. Canonical, seeded, paired results: speedup 1.17× / score
22.3 (LPIPS +1.7%, region-timed); stability 6.8% ± 1.2% / score 6.8 (3 seeds,
pooled paired t=5.15, p<1e-4). Patch policy accepts the references and rejects
metric edits + env-var leakage; the CPU smoke path (
FRONTIER_NWM_SMOKE=1)validates the policy + passes the empty reference for offline CI.
Adversarial audit & hardening
These tasks were put through a multi-agent adversarial audit (5 independent
auditors + refute-pass verifiers; full report + fix-validation table in each task's
AUDIT_REPORT.md, included in this PR). The strong attacks were refuted on the artifacts: thefp32 speedup baseline is fair (the native
--use_fp16flag is denylisted, so fp32is the genuine in-scope default — not a strawman), there is no train/test leakage
(22/22 held-out clips ∈ test split, 0 ∈ train), CSGO's frontier is real, and the
RT-1/Maze rejections hold.
One high-severity issue surfaced and is fixed: the rollout RNG was unseeded
and the baseline cached-not-paired, so ~2–3% run-to-run noise (≈ the effect sizes)
made a no-op patch score nonzero (gameable). Fix (judge infrastructure, outside the
agent's editable scope; shared template untouched): deterministic clip-keyed
seeding (baseline & patched draw identical per-clip initial noise — common random
numbers) + a real per-region sampling timer. Re-validated: a no-op patch now
scores 0.15 / 0.000 ≈ 0, per-clip LPIPS is bit-identical across repeats, residual
wall-noise dropped to 0.24%, the audit's impossible −4.97% bf16 "improvement"
collapsed to the expected +1.7%, and the stability effect tightened to p<1e-4
across 3 seeds. Lower-severity items (doc number reconciliation, multi-seed stats,
LPIPS-table labeling) are also addressed.
Patch policy
Python-only, ≤256 KB. Allowed:
src/diffusion/**.py,src/sample/sampling_utils.py. Denied: model/VAE/metric/rollout-harness/data/training; native/build files; benchmark detection, env-var leakage, timing
short-circuits. The rollout invocation is fixed by the judge; the agent changes
sampler internals only.
CI status (
validate-benchmark20)This check is expected red for these tasks, exactly like #145 — and
mainhasno required status checks, so it is non-blocking.
scripts/validate_problems.pyruns each task's reference inside the task's Docker image; for a Modal-GPU task the
real image bakes the NanoWM checkout + L/2 CSGO ckpt + held-out subset and is built
locally via
docker/build_images.sh, so it is not on a public registry CI canpull. The job therefore fails at
docker pullwithpull access denied ... repository does not existfor both tasks — the same point#145 fails at. There is no green path for a Modal-GPU 2.0 task without either a
GPU + published image + Modal token in CI, or editing the shared validator (which
this PR deliberately does not touch).
Everything up to that point is now well-formed (this was the one fix in the latest
push):
language: pythonresolves,reference.pyis found, the evaluator importsand — on the no-GPU smoke path — returns 1.0 for both tasks locally. Real
correctness is the Della H100 end-to-end runs above; once the images are published
to a registry the check goes green unchanged.
Notes for maintainers
*/modal_app.py, mirrors [2.0] Add new Frontier-CS 2.0 problem vllm_llm_serving_optimization #145); the judge stays CPU. End-to-endModal deploy is pending Modal credentials — the local-backend H100 runs above are
the validated proof; Modal only swaps the GPU location. A turnkey deploy+test
script is included in the task working repo.
are baked into the judge image at build time via each task's
docker/build_images.sh— not committed here.
(quality peaks at ~5 steps), so speedup is trivial. CSGO is uniquely suited
(monotonic step↔quality from the complex 16-frame-window model).
🤖 Generated with Claude Code