Skip to content

[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150

Open
wenhaochai wants to merge 10 commits into
mainfrom
problem/nanowm-rollout-tasks
Open

[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150
wenhaochai wants to merge 10 commits into
mainfrom
problem/nanowm-rollout-tasks

Conversation

@wenhaochai

@wenhaochai wenhaochai commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Two paired Frontier-CS 2.0 tasks built from Nano World Models (arXiv:2605.23993,
simchowitzlabpublic/nano-world-model, MIT) — a frozen NanoWM-L/2 CSGO diffusion
world model. Same shape as vllm_llm_serving_optimization (#145): submit a
Python-only patch to source, the judge applies it and measures a continuous
metric with a guardrail, GPU runs on Modal (CPU judge). No changes to the
shared 2.0 adapter/template
— each task is self-contained + per-problem Modal.

The two are a dual pair on the same frozen model:

nanowm_rollout_speedup — minimize inference latency at iso-quality

Agent patches the diffusion sampling layer to speed up a fixed 50-step,
50-frame CSGO rollout without losing quality.
score = clip(100·log2(geomean_speedup),0,100) · quality_mult (LPIPS-vs-GT
guardrail ≤3%). The CSGO step↔quality frontier is real (seq@2 = +31% LPIPS vs
seq@50, monotonic). Reference (bf16 autocast) = 1.17× at iso-quality (LPIPS +1.7%, within the 3% guardrail), score 22.3.

nanowm_rollout_stability — minimize long-horizon drift at iso-wall-clock

The dual: patch the sampling layer to reduce 80-frame tail drift (mean
LPIPS-vs-GT over frames ≥60) under a fixed wall-clock budget (so drift isn't
bought with compute — that's the speedup axis).
score = clip(100·(base_tail−patched_tail)/base_tail,0,100) · wallclock_mult.
Reference (history-stabilization bump) beats baseline robustly under common-random
-numbers pairing (74% per-clip win; pooled paired t=5.15, p<1e-4, Wilcoxon p<1e-4
over 3 seeds × 22 clips) = 6.8% ± 1.2% drift reduction, score 6.8 (iso-wall-clock).

Validation (Della H100)

Both run the real evaluator.py end-to-end via a local-GPU backend (Modal
stand-in): patch-policy validation → patch apply → CSGO rollout → metric →
guardrail → scoring. Canonical, seeded, paired results: speedup 1.17× / score
22.3
(LPIPS +1.7%, region-timed); stability 6.8% ± 1.2% / score 6.8 (3 seeds,
pooled paired t=5.15, p<1e-4). Patch policy accepts the references and rejects
metric edits + env-var leakage; the CPU smoke path (FRONTIER_NWM_SMOKE=1)
validates the policy + passes the empty reference for offline CI.

Adversarial audit & hardening

These tasks were put through a multi-agent adversarial audit (5 independent
auditors + refute-pass verifiers; full report + fix-validation table in each task's
AUDIT_REPORT.md, included in this PR). The strong attacks were refuted on the artifacts: the
fp32 speedup baseline is fair (the native --use_fp16 flag is denylisted, so fp32
is the genuine in-scope default — not a strawman), there is no train/test leakage
(22/22 held-out clips ∈ test split, 0 ∈ train), CSGO's frontier is real, and the
RT-1/Maze rejections hold.

One high-severity issue surfaced and is fixed: the rollout RNG was unseeded
and the baseline cached-not-paired, so ~2–3% run-to-run noise (≈ the effect sizes)
made a no-op patch score nonzero (gameable). Fix (judge infrastructure, outside the
agent's editable scope; shared template untouched): deterministic clip-keyed
seeding
(baseline & patched draw identical per-clip initial noise — common random
numbers) + a real per-region sampling timer. Re-validated: a no-op patch now
scores 0.15 / 0.000 ≈ 0, per-clip LPIPS is bit-identical across repeats, residual
wall-noise dropped to 0.24%, the audit's impossible −4.97% bf16 "improvement"
collapsed to the expected +1.7%, and the stability effect tightened to p<1e-4
across 3 seeds. Lower-severity items (doc number reconciliation, multi-seed stats,
LPIPS-table labeling) are also addressed.

Patch policy

Python-only, ≤256 KB. Allowed: src/diffusion/**.py,
src/sample/sampling_utils.py. Denied: model/VAE/metric/rollout-harness/data/
training; native/build files; benchmark detection, env-var leakage, timing
short-circuits. The rollout invocation is fixed by the judge; the agent changes
sampler internals only.

CI status (validate-benchmark20)

This check is expected red for these tasks, exactly like #145 — and main has
no required status checks, so it is non-blocking. scripts/validate_problems.py
runs each task's reference inside the task's Docker image; for a Modal-GPU task the
real image bakes the NanoWM checkout + L/2 CSGO ckpt + held-out subset and is built
locally via docker/build_images.sh, so it is not on a public registry CI can
pull
. The job therefore fails at docker pull with
pull access denied ... repository does not exist for both tasks — the same point
#145 fails at. There is no green path for a Modal-GPU 2.0 task without either a
GPU + published image + Modal token in CI, or editing the shared validator (which
this PR deliberately does not touch).

Everything up to that point is now well-formed (this was the one fix in the latest
push): language: python resolves, reference.py is found, the evaluator imports
and — on the no-GPU smoke path — returns 1.0 for both tasks locally. Real
correctness is the Della H100 end-to-end runs above; once the images are published
to a registry the check goes green unchanged.

Notes for maintainers

  • GPU via Modal (*/modal_app.py, mirrors [2.0] Add new Frontier-CS 2.0 problem vllm_llm_serving_optimization #145); the judge stays CPU. End-to-end
    Modal deploy is pending Modal credentials — the local-backend H100 runs above are
    the validated proof; Modal only swaps the GPU location. A turnkey deploy+test
    script is included in the task working repo.
  • Hidden assets (L/2 CSGO ckpt + held-out CSGO episode subset + cached baseline)
    are baked into the judge image at build time via each task's docker/build_images.sh
    — not committed here.
  • An RT-1 domain variant was evaluated and rejected: RT-1 is over-provisioned
    (quality peaks at ~5 steps), so speedup is trivial. CSGO is uniquely suited
    (monotonic step↔quality from the complex 16-frame-window model).

🤖 Generated with Claude Code

wenhaochai and others added 10 commits June 13, 2026 11:22
…> CSGO rollout -> speedup + LPIPS guardrail) + bf16 reference.patch
…SIGN, evaluator (patch policy validated: accepts ref, rejects metric-edit+env-leak), orchestrate+modal_app (Modal GPU, judge CPU), harbor/app, docker agent+judge, infra patch, evaluate.sh. Patch-policy + smoke validated on CPU; GPU runner validating on H100
…, score 22.4, qmult 1.0) — end-to-end H100 local backend
…frame tail-drift at iso-wall-clock. Reference (stab=0.20) reliably beats baseline (t~2.5/22 clips). Reuses framework; patch policy + smoke validated
…il-drift reduction @iso-wall-clock, score 5.54 > baseline)
The CI validator (scripts/validate_problems.py -> get_language_config) only
supports {python, cpp, rust}; `language: patch` raised
`ValueError: Unsupported language: patch` and crashed the whole validate
step with a traceback before any per-problem logic ran.

Mirror vllm_llm_serving_optimization (#145), the canonical Modal-GPU 2.0
task: declare `language: python`, keep the real solution in reference.patch,
and make reference.py a docstring-only placeholder. Also align the tag to
`systems` (matches #145 and the sibling systems-optimization tasks
duckdb_e2e_query_optimization / vector_db_ann).

Submission contract is unchanged: agents still submit /app/solution.patch;
the static patch policy + Modal-GPU scoring in {speedup,stability}_eval are
untouched. The judge stays CPU-only (no runtime.docker.gpu), exactly like
#145. Verified locally: language resolves to python, reference.py is found,
both evaluators return 1.0 on the no-GPU smoke path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
An adversarial audit found the rollout metric was unseeded and the baseline
cached-not-paired: ~2-3% run-to-run noise (≈ the effect sizes) let a no-op
patch score nonzero (gameable), and the bf16 reference showed an impossible
-4.97% LPIPS "improvement" that was pure noise.

Fix (judge infrastructure, outside the agent's editable sampling scope; the
shared 2.0 adapter/template is untouched):
- Deterministic clip-keyed seeding in the rollout harness so the baseline
  (unpatched) and patched arms draw identical per-clip initial noise (common
  random numbers); batch boundaries are clip-aligned so QUICK is a
  noise-identical prefix of FINAL. Regenerated into
  infra_patches/0001-rollout-judge-infra.patch (was chunked-decode only).
- Real per-region sampling timer (writes NWM_TIME_FILE) so the speedup metric
  isolates the patchable region (was dead code -> full-process wall-clock).
- *_eval: settings add SEED; runner passes --seed; orchestrate baseline cache
  keyed by (== clips, seed) so a cached baseline is a valid CRN partner.

Re-validated on Della H100 (3 seeds): a no-op patch now scores 0.15 / 0.000
(ungameable), residual wall noise 0.24%, bf16 LPIPS delta +1.7% (expected),
stability drift reduction 6.8% +/- 1.2% pooled paired t=5.15 p<1e-4. Canonical
numbers reconciled across DESIGN/PR_SUMMARY (speedup 1.17x/22.3, stability
6.8%/6.8); LPIPS tables labeled. Standing audit conclusions C1-C6 unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds AUDIT_REPORT.md (the multi-agent adversarial audit + the "fixes applied
+ validated" table) to both task dirs so reviewers can see the provenance:
the refuted strong attacks (fair fp32 baseline, no leakage, real frontier),
the one HIGH finding (unseeded/cached metric) and its deterministic-seeding
fix, and the post-fix validation numbers. Absolute working-repo paths
rewritten to relative. PR body reference updated to point here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant