FrontierCS · wenhaochai · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/2.0/README.md b/2.0/README.md
@@ -81,3 +81,25 @@ single ICCAD2015 design, `superblue1`. Its problem ID is
 `bboplace_direct_iccad2015`. It follows the same JSON interface and single
 design evaluation flow as `bboplace_direct_ispd2005`, with the ICCAD2015
 baseline for `superblue1`.
+
+## NanoWM Rollout Speedup
+
+This systems problem asks agents to speed up diffusion sampling for a frozen
+video world model. Its problem ID is `nanowm_rollout_speedup`. Agents submit a
+Python-only patch to the sampling layer of Nano World Models (arXiv:2605.23993);
+the judge runs a fixed NanoWM-L/2 CSGO 50-frame long-rollout on a Modal GPU and
+scores wall-clock speedup over the unpatched baseline, gated by an LPIPS rollout-
+quality guardrail (so naive step-cutting fails — real fast-sampling is required).
+Mirrors `vllm_llm_serving_optimization`: patch + latency + accuracy guardrail +
+Modal GPU, CPU judge.
+
+## NanoWM Rollout Stability
+
+The dual of `nanowm_rollout_speedup`: minimize long-horizon **drift** at fixed
+compute. Its problem ID is `nanowm_rollout_stability`. Agents submit a Python-only
+patch to the NanoWM sampling layer; the judge runs a fixed 80-frame NanoWM-L/2
+CSGO rollout (50 steps) on a Modal GPU and scores the relative reduction in
+tail-frame (≥60) LPIPS-vs-GT over the unpatched baseline, gated by a wall-clock
+guardrail (so drift can't be bought with more compute). A history-stabilization
+reference reliably beats baseline (validated t≈2.5/22 clips); beating it
+substantially is the open challenge.
diff --git a/2.0/problems/nanowm_rollout_speedup/AUDIT_REPORT.md b/2.0/problems/nanowm_rollout_speedup/AUDIT_REPORT.md
diff --git a/2.0/problems/nanowm_rollout_speedup/DESIGN.md b/2.0/problems/nanowm_rollout_speedup/DESIGN.md
@@ -0,0 +1,97 @@
+# Design — nanowm_rollout_speedup
+
+## 1. Task
+
+Optimize the inference **latency** of a frozen video world model's
+autoregressive long-rollout, at iso-quality. The agent submits a Python-only
+patch to the diffusion **sampling** layer of Nano World Models (arXiv:2605.23993);
+the judge runs a fixed NanoWM-L/2 CSGO 50-frame rollout, scores wall-clock
+speedup over the unpatched baseline, gated by an LPIPS-vs-GT quality guardrail.
+
+Shape and scoring mirror `vllm_llm_serving_optimization` (#145): patch a source
+tree, build/run on a Modal GPU from a CPU judge, score `100·log2(speedup)` with
+an accuracy guardrail. **No change to the general 2.0 adapter/template** — GPU is
+per-problem via Modal.
+
+## 2. Why this is a real task (calibration)
+
+Measured on Della H100, NanoWM-L/2 CSGO, 50-frame rollout, 12 held-out episodes,
+LPIPS vs ground truth (means over **all** frames incl. the 4 context frames — the
+same convention the judge scores on; generated-only means are ~0.045 higher):
+
+| DDIM steps | speedup | LPIPS-vs-GT | Δ vs seq@50 |
+|---|---|---|---|
+| 50 (baseline) | 1× | 0.517 | — |
+| 20 | 2.5× | 0.543 | +5% |
+| 10 | 5× | 0.543 | +5% |
+| 5 | 10× | 0.579 | +12% |
+| 2 | 25× | 0.676 | +31% |
+
+CSGO has a genuine steps↔quality frontier (paper Fig. 6): naive step-cutting
+degrades quality fast (seq@5 = +12%, seq@2 = +31%). With a 3% guardrail, even
+seq@20 (+5%) fails, so a winning patch must reproduce ~50-step quality with less
+compute via real fast-sampling techniques. (Contrast: on the visually-trivial
+PushT domain, seq@2 reproduces seq@250 within the stochastic noise floor — no
+frontier; CSGO was chosen precisely because the frontier is real.)
+
+## 3. Patch policy (validated before running)
+
+Python-only (`.py`/`.pyi`), ≤256 KB, no file deletion, safe paths.
+
+- **Allowed:** `src/diffusion/**.py`, `src/sample/sampling_utils.py` (the sampling
+  layer: scheduling matrices, the DDIM/diffusion-forcing loop, solvers, caches).
+- **Denied:** model (`src/models/**`), VAE (`src/latent_codecs/**`), metric
+  (`src/sample/evaluate_metrics.py`), harness (`src/sample/rollout.py`), data
+  (`src/wm_datasets/**`), training/eval/utils, native/build/deps.
+- **Rejected added-line tokens:** `FRONTIER_*/JUDGE_/HARBOR_/MODAL_/HF_TOKEN`,
+  metric/GT/timing identifiers, `os.environ`, `subprocess`, `socket`,
+  `time.sleep`, `while True`, hard-coded judge paths — i.e. no benchmark
+  detection, env-var leakage, output hard-coding, or timing short-circuits.
+
+The rollout invocation (length/context/nominal-steps/scheduling) is fixed by the
+judge; the agent changes only sampler internals (cf. #145 fixing the serving
+config and patching the scheduler).
+
+## 4. GPU on Modal (judge stays CPU)
+
+`speedup_eval/modal_app.py` runs one rollout in a Modal GPU function from baked
+assets (NanoWM checkout + L/2 CSGO ckpt + held-out CSGO episode subset);
+`orchestrate.run_pair` computes the cached vanilla baseline once and the patched
+run per submission. A `local` backend (`orchestrate._run_local`) runs the same
+`speedup_eval.runner` on a directly-visible GPU — this is the path validated on
+Della H100. End-to-end Modal execution awaits maintainer Modal credentials.
+
+## 5. Scoring
+
+`speedup_eval/scoring.py` (shared with the public test): geomean rollout speedup
+→ `clip(100·log2, 0, 100) · quality_multiplier`, where the multiplier is 1.0
+within `quality_tolerance` LPIPS rise and decays inverse-proportionally beyond.
+`score_unbounded` keeps rewarding speedup past 2×.
+
+## 6. Reference solution
+
+`reference.patch`: a one-line bf16-autocast wrap of the sampling loop in
+`gaussian_diffusion.dfot_sample_loop` — a quality-preserving speedup that beats
+the fp32 baseline (CI requires reference > baseline). The intended frontier
+(DPM-Solver++, caching, distillation) is left to the agent.
+
+**Validated end-to-end on Della H100 (local backend, 16 CSGO clips, seed 42,
+sampling-region timed):** baseline 1102.1 s / LPIPS 0.523 → bf16-patched 944.0 s /
+LPIPS 0.532 ⇒ **1.17× speedup, LPIPS +1.7% (within the 3% guardrail),
+quality_multiplier 1.0, score 22.3** (reference > baseline ✓). The judge seeds the
+rollout deterministically per clip (common random numbers), so the baseline and
+patched arms share initial noise: a **no-op patch scores 0.15 (≈0, ungameable)**
+and residual wall-clock noise is **0.24%** (the region timer excludes model/VAE/
+dataset load and the VAE decode the patch cannot touch). Patch policy validated
+(accepts reference; rejects metric edits + env-var leakage); smoke path returns
+1.0 with the empty reference on CPU. The frontier (seq@2 = +31% LPIPS) leaves wide
+headroom above the 1.17× reference for real fast-sampling patches. **Pending:**
+end-to-end Modal execution (maintainer credentials).
+
+## 7. Open items for maintainers
+
+- Modal end-to-end run + a deployed `modal_app` app name / GPU type confirmation.
+- Bake-asset provenance for the judge image (ckpt + held-out CSGO subset + cached
+  baseline metrics); the held-out episode ids are the only hidden component.
+- CI smoke (`FRONTIER_NWM_SMOKE=1`, CPU) validates the patch policy + empty
+  reference; confirm this matches the repo's CI expectation for GPU/Modal tasks.
diff --git a/2.0/problems/nanowm_rollout_speedup/config.yaml b/2.0/problems/nanowm_rollout_speedup/config.yaml
@@ -0,0 +1,64 @@
+tag: systems
+runtime:
+  # Submission is a Python-only source patch (the real reference is
+  # reference.patch). `language: python` keeps the file extension/CLI conventions
+  # standard (mirrors vllm_llm_serving_optimization, #145); there is no separate
+  # "patch" language in the framework.
+  language: python
+  timeout_seconds: 21600
+  environment: >-
+    Python-only patch against a clean NanoWM checkout (Nano World Models,
+    arXiv:2605.23993); Modal GPU runs the NanoWM-L/2 CSGO 50-frame long-rollout;
+    speedup-vs-baseline judge with an LPIPS rollout-quality guardrail
+  apt_packages:
+    - bash
+    - ca-certificates
+    - curl
+    - git
+    - python3
+    - python3-pip
+  judge_apt_packages:
+    - bash
+    - ca-certificates
+    - curl
+    - git
+    - python3
+    - python3-pip
+  judge_pip_packages:
+    - modal
+  docker:
+    # Experimental local images; build with docker/build_images.sh before a local
+    # Harbor trial. Both bake a clean NanoWM checkout + the L/2 CSGO ckpt; the
+    # judge image additionally vendors the held-out CSGO episode subset, the
+    # LPIPS scorer, and the cached vanilla baseline metrics.
+    image: frontiercs/nanowm-rollout-speedup-agent:experimental-v0
+    judge_image: frontiercs/nanowm-rollout-speedup-judge:experimental-v0
+environment:
+  cpus: 8
+  memory_mb: 32768
+  storage_mb: 32768
+  build_timeout_seconds: 5400
+evaluation:
+  # GPU served on Modal (one per environment); judge container is CPU-only.
+  model: nanowm_l2_csgo
+  dataset: game/csgo
+  gpu: L40S
+  # FIXED rollout invocation (the agent's patch changes sampler internals, not these).
+  rollout_length: 50
+  history_length: 4
+  num_steps: 50            # nominal reference DDIM budget
+  scheduling: sequential
+  history_stab: 0.02
+  # Quality guardrail: patched rollout LPIPS-vs-GT may rise at most this
+  # (relative) above the unpatched seq@50 baseline before the score is penalized.
+  # Calibrated: seq@20 is already +5% over seq@50, so a 3% tolerance forces real
+  # fast-sampling work (DPM-Solver++, caching, distillation), not naive step cuts.
+  quality_tolerance: 0.03
+  quick_clips: 4           # iterative (agent-role) public feedback
+  final_clips: 16          # final (verifier-role) evaluation
+  batch_size: 4
+  baseline_cache_path: /opt/nanowm/baseline/baseline_metrics.json
+submission:
+  kind: file
+  path: /app/solution.patch
+  max_queue_size: 2
diff --git a/2.0/problems/nanowm_rollout_speedup/docker/agent/Dockerfile b/2.0/problems/nanowm_rollout_speedup/docker/agent/Dockerfile
@@ -0,0 +1,32 @@
+# Agent workspace image for nanowm_rollout_speedup (CPU-only client).
+# The agent edits the NanoWM checkout under /app/nano-world-model and submits a
+# patch; GPU evaluation happens on Modal (judge side). Build via ../build_images.sh.
+FROM python:3.11-slim
+
+ENV DEBIAN_FRONTEND=noninteractive PIP_NO_CACHE_DIR=1
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        git curl ca-certificates bash ripgrep patch && rm -rf /var/lib/apt/lists/*
+
+# Pre-install Claude Code + Codex CLI (parity with other 2.0 agent images).
+RUN curl -fsSL https://claude.ai/install.sh | bash -s -- && \
+    echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
+ENV NVM_DIR="/root/.nvm"
+RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash && \
+    . "$NVM_DIR/nvm.sh" && nvm install 22 && nvm alias default 22 && \
+    npm install -g @openai/codex@latest && ln -sf "$(which node)" /usr/local/bin/node
+
+WORKDIR /app
+# Clean NanoWM checkout the agent patches (pinned commit; native build not needed).
+ARG NANOWM_COMMIT=main
+RUN git clone https://github.com/simchowitzlabpublic/nano-world-model /app/nano-world-model && \
+    cd /app/nano-world-model && git checkout "$NANOWM_COMMIT"
+# Task-local infra patches the agent builds on (decode-OOM + get_seq_length);
+# applied so the harness rollout works, outside the agent's editable scope.
+COPY task_ctx/infra_patches/ /tmp/infra_patches/
+RUN cd /app/nano-world-model && for p in /tmp/infra_patches/*.patch; do \
+        [ -e "$p" ] && git apply "$p" || true; done && \
+    git add -A && git -c user.email=t@e -c user.name=t commit -q -m base || true
+
+COPY task_ctx/harbor_app/ /app/
+COPY task_ctx/task_pkg/ /app/task/
+RUN chmod +x /app/*.sh 2>/dev/null || true
diff --git a/2.0/problems/nanowm_rollout_speedup/docker/build_images.sh b/2.0/problems/nanowm_rollout_speedup/docker/build_images.sh
@@ -0,0 +1,32 @@
+#!/usr/bin/env bash
+# Build the nanowm_rollout_speedup agent + judge images.
+#   bash 2.0/problems/nanowm_rollout_speedup/docker/build_images.sh [tag]
+# Requires the hidden assets staged under $NWM_ASSETS (default below):
+#   ckpts/nanowm-l2-csgo/model_state_dict.pt
+#   data/csgo/1-200/*.hdf5        (held-out CSGO episode subset)
+#   data/csgo_subset/{val_files.txt,val_starts.npy}
+#   baseline/baseline_metrics.json (optional; computed on first trial otherwise)
+set -euo pipefail
+TAG="${1:-experimental-v0}"
+NANOWM_COMMIT="${NANOWM_COMMIT:-main}"
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+PROB=$(cd "$SCRIPT_DIR/.." && pwd)
+NWM_ASSETS="${NWM_ASSETS:-/scratch/gpfs/KARTHIKN/wc9403/project/fcs-2/data}"
+
+CTX=$(mktemp -d); trap 'rm -rf "$CTX"' EXIT
+mkdir -p "$CTX/task_ctx"
+cp -r "$PROB/speedup_eval" "$CTX/task_ctx/task_pkg"
+cp "$PROB/evaluator.py" "$CTX/task_ctx/evaluator.py"
+cp -r "$PROB/harbor/app" "$CTX/task_ctx/harbor_app"
+cp -r "$PROB/infra_patches" "$CTX/task_ctx/infra_patches" 2>/dev/null || mkdir -p "$CTX/task_ctx/infra_patches"
+# judge-only hidden assets
+mkdir -p "$CTX/task_ctx/assets"
+cp -r "$NWM_ASSETS/ckpts/nanowm-l2-csgo-100k" "$CTX/task_ctx/assets/ckpts/nanowm-l2-csgo" 2>/dev/null || true
+cp -r "$NWM_ASSETS/csgo" "$CTX/task_ctx/assets/data/csgo" 2>/dev/null || true
+cp -r "$NWM_ASSETS/csgo_subset" "$CTX/task_ctx/assets/data/csgo_subset" 2>/dev/null || true
+
+docker build --target "" --build-arg NANOWM_COMMIT="$NANOWM_COMMIT" \
+  -t "frontiercs/nanowm-rollout-speedup-agent:$TAG" -f "$SCRIPT_DIR/agent/Dockerfile" "$CTX"
+docker build --build-arg NANOWM_COMMIT="$NANOWM_COMMIT" \
+  -t "frontiercs/nanowm-rollout-speedup-judge:$TAG" -f "$SCRIPT_DIR/judge/Dockerfile" "$CTX"
+echo "Built frontiercs/nanowm-rollout-speedup-{agent,judge}:$TAG"
diff --git a/2.0/problems/nanowm_rollout_speedup/docker/judge/Dockerfile b/2.0/problems/nanowm_rollout_speedup/docker/judge/Dockerfile
@@ -0,0 +1,35 @@
+# Judge image for nanowm_rollout_speedup (CPU-only; drives a Modal GPU).
+# Bakes the task package + clean NanoWM checkout + L/2 CSGO ckpt + held-out CSGO
+# episode subset + cached vanilla baseline; the GPU rollout runs on Modal.
+FROM python:3.11-slim
+
+ENV DEBIAN_FRONTEND=noninteractive PIP_NO_CACHE_DIR=1 FRONTIER_NWM_PKG=/opt/nanowm/task
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        git curl ca-certificates bash patch && rm -rf /var/lib/apt/lists/*
+RUN pip install modal omegaconf hydra-core
+
+WORKDIR /opt/nanowm
+ARG NANOWM_COMMIT=main
+RUN git clone https://github.com/simchowitzlabpublic/nano-world-model /opt/nanowm/nano-world-model && \
+    cd /opt/nanowm/nano-world-model && git checkout "$NANOWM_COMMIT"
+COPY task_ctx/infra_patches/ /tmp/infra_patches/
+RUN cd /opt/nanowm/nano-world-model && for p in /tmp/infra_patches/*.patch; do \
+        [ -e "$p" ] && git apply "$p" || true; done
+
+# Hidden assets baked by build_images.sh into the build context:
+#   task_ctx/assets/ckpts/nanowm-l2-csgo/model_state_dict.pt
+#   task_ctx/assets/csgo/1-200/*.hdf5            (held-out episode subset)
+#   task_ctx/assets/csgo_subset/{val_files.txt,val_starts.npy}
+#   task_ctx/assets/baseline/baseline_metrics.json (optional precomputed)
+COPY task_ctx/assets/ /opt/nanowm/
+COPY task_ctx/task_pkg/ /opt/nanowm/task/
+COPY task_ctx/evaluator.py /opt/nanowm/task/evaluator.py
+
+# Modal builds the GPU image from this same checkout/assets at deploy time.
+ENV FRONTIER_NWM_REPO=/opt/nanowm/nano-world-model \
+    FRONTIER_NWM_CKPT=/opt/nanowm/ckpts/nanowm-l2-csgo/model_state_dict.pt \
+    FRONTIER_NWM_CSGO_DATA=/opt/nanowm/data/csgo \
+    FRONTIER_NWM_VAL_FILES=/opt/nanowm/data/csgo_subset/val_files.txt \
+    FRONTIER_NWM_VAL_STARTS=/opt/nanowm/data/csgo_subset/val_starts.npy \
+    FRONTIER_NWM_BASELINE_CACHE=/opt/nanowm/baseline/baseline_metrics.json \
+    NWM_BAKE_DIR=/opt/nanowm
diff --git a/2.0/problems/nanowm_rollout_speedup/evaluate.sh b/2.0/problems/nanowm_rollout_speedup/evaluate.sh
@@ -0,0 +1,14 @@
+#!/usr/bin/env bash
+# Local CLI evaluation for nanowm_rollout_speedup.
+# Full runs need a GPU (local backend) or Modal credentials, plus the baked
+# NanoWM assets (FRONTIER_NWM_* env, see speedup_eval/settings.py). Without a GPU
+# this falls back to the smoke path (patch-policy validation only), which is what
+# repository CI exercises.
+set -euo pipefail
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+export FRONTIER_NWM_PKG="${FRONTIER_NWM_PKG:-$SCRIPT_DIR}"
+if ! python3 -c 'import torch; assert torch.cuda.is_available()' >/dev/null 2>&1; then
+  export FRONTIER_NWM_SMOKE="${FRONTIER_NWM_SMOKE:-1}"
+fi
+SOLUTION="${1:-$SCRIPT_DIR/reference.patch}"
+exec python3 "$SCRIPT_DIR/evaluator.py" "$SOLUTION"