Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions 2.0/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,25 @@ single ICCAD2015 design, `superblue1`. Its problem ID is
`bboplace_direct_iccad2015`. It follows the same JSON interface and single
design evaluation flow as `bboplace_direct_ispd2005`, with the ICCAD2015
baseline for `superblue1`.

## NanoWM Rollout Speedup

This systems problem asks agents to speed up diffusion sampling for a frozen
video world model. Its problem ID is `nanowm_rollout_speedup`. Agents submit a
Python-only patch to the sampling layer of Nano World Models (arXiv:2605.23993);
the judge runs a fixed NanoWM-L/2 CSGO 50-frame long-rollout on a Modal GPU and
scores wall-clock speedup over the unpatched baseline, gated by an LPIPS rollout-
quality guardrail (so naive step-cutting fails — real fast-sampling is required).
Mirrors `vllm_llm_serving_optimization`: patch + latency + accuracy guardrail +
Modal GPU, CPU judge.

## NanoWM Rollout Stability

The dual of `nanowm_rollout_speedup`: minimize long-horizon **drift** at fixed
compute. Its problem ID is `nanowm_rollout_stability`. Agents submit a Python-only
patch to the NanoWM sampling layer; the judge runs a fixed 80-frame NanoWM-L/2
CSGO rollout (50 steps) on a Modal GPU and scores the relative reduction in
tail-frame (≥60) LPIPS-vs-GT over the unpatched baseline, gated by a wall-clock
guardrail (so drift can't be bought with more compute). A history-stabilization
reference reliably beats baseline (validated t≈2.5/22 clips); beating it
substantially is the open challenge.
123 changes: 123 additions & 0 deletions 2.0/problems/nanowm_rollout_speedup/AUDIT_REPORT.md

Large diffs are not rendered by default.

97 changes: 97 additions & 0 deletions 2.0/problems/nanowm_rollout_speedup/DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Design — nanowm_rollout_speedup

## 1. Task

Optimize the inference **latency** of a frozen video world model's
autoregressive long-rollout, at iso-quality. The agent submits a Python-only
patch to the diffusion **sampling** layer of Nano World Models (arXiv:2605.23993);
the judge runs a fixed NanoWM-L/2 CSGO 50-frame rollout, scores wall-clock
speedup over the unpatched baseline, gated by an LPIPS-vs-GT quality guardrail.

Shape and scoring mirror `vllm_llm_serving_optimization` (#145): patch a source
tree, build/run on a Modal GPU from a CPU judge, score `100·log2(speedup)` with
an accuracy guardrail. **No change to the general 2.0 adapter/template** — GPU is
per-problem via Modal.

## 2. Why this is a real task (calibration)

Measured on Della H100, NanoWM-L/2 CSGO, 50-frame rollout, 12 held-out episodes,
LPIPS vs ground truth (means over **all** frames incl. the 4 context frames — the
same convention the judge scores on; generated-only means are ~0.045 higher):

| DDIM steps | speedup | LPIPS-vs-GT | Δ vs seq@50 |
|---|---|---|---|
| 50 (baseline) | 1× | 0.517 | — |
| 20 | 2.5× | 0.543 | +5% |
| 10 | 5× | 0.543 | +5% |
| 5 | 10× | 0.579 | +12% |
| 2 | 25× | 0.676 | +31% |

CSGO has a genuine steps↔quality frontier (paper Fig. 6): naive step-cutting
degrades quality fast (seq@5 = +12%, seq@2 = +31%). With a 3% guardrail, even
seq@20 (+5%) fails, so a winning patch must reproduce ~50-step quality with less
compute via real fast-sampling techniques. (Contrast: on the visually-trivial
PushT domain, seq@2 reproduces seq@250 within the stochastic noise floor — no
frontier; CSGO was chosen precisely because the frontier is real.)

## 3. Patch policy (validated before running)

Python-only (`.py`/`.pyi`), ≤256 KB, no file deletion, safe paths.

- **Allowed:** `src/diffusion/**.py`, `src/sample/sampling_utils.py` (the sampling
layer: scheduling matrices, the DDIM/diffusion-forcing loop, solvers, caches).
- **Denied:** model (`src/models/**`), VAE (`src/latent_codecs/**`), metric
(`src/sample/evaluate_metrics.py`), harness (`src/sample/rollout.py`), data
(`src/wm_datasets/**`), training/eval/utils, native/build/deps.
- **Rejected added-line tokens:** `FRONTIER_*/JUDGE_/HARBOR_/MODAL_/HF_TOKEN`,
metric/GT/timing identifiers, `os.environ`, `subprocess`, `socket`,
`time.sleep`, `while True`, hard-coded judge paths — i.e. no benchmark
detection, env-var leakage, output hard-coding, or timing short-circuits.

The rollout invocation (length/context/nominal-steps/scheduling) is fixed by the
judge; the agent changes only sampler internals (cf. #145 fixing the serving
config and patching the scheduler).

## 4. GPU on Modal (judge stays CPU)

`speedup_eval/modal_app.py` runs one rollout in a Modal GPU function from baked
assets (NanoWM checkout + L/2 CSGO ckpt + held-out CSGO episode subset);
`orchestrate.run_pair` computes the cached vanilla baseline once and the patched
run per submission. A `local` backend (`orchestrate._run_local`) runs the same
`speedup_eval.runner` on a directly-visible GPU — this is the path validated on
Della H100. End-to-end Modal execution awaits maintainer Modal credentials.

## 5. Scoring

`speedup_eval/scoring.py` (shared with the public test): geomean rollout speedup
→ `clip(100·log2, 0, 100) · quality_multiplier`, where the multiplier is 1.0
within `quality_tolerance` LPIPS rise and decays inverse-proportionally beyond.
`score_unbounded` keeps rewarding speedup past 2×.

## 6. Reference solution

`reference.patch`: a one-line bf16-autocast wrap of the sampling loop in
`gaussian_diffusion.dfot_sample_loop` — a quality-preserving speedup that beats
the fp32 baseline (CI requires reference > baseline). The intended frontier
(DPM-Solver++, caching, distillation) is left to the agent.

**Validated end-to-end on Della H100 (local backend, 16 CSGO clips, seed 42,
sampling-region timed):** baseline 1102.1 s / LPIPS 0.523 → bf16-patched 944.0 s /
LPIPS 0.532 ⇒ **1.17× speedup, LPIPS +1.7% (within the 3% guardrail),
quality_multiplier 1.0, score 22.3** (reference > baseline ✓). The judge seeds the
rollout deterministically per clip (common random numbers), so the baseline and
patched arms share initial noise: a **no-op patch scores 0.15 (≈0, ungameable)**
and residual wall-clock noise is **0.24%** (the region timer excludes model/VAE/
dataset load and the VAE decode the patch cannot touch). Patch policy validated
(accepts reference; rejects metric edits + env-var leakage); smoke path returns
1.0 with the empty reference on CPU. The frontier (seq@2 = +31% LPIPS) leaves wide
headroom above the 1.17× reference for real fast-sampling patches. **Pending:**
end-to-end Modal execution (maintainer credentials).

## 7. Open items for maintainers

- Modal end-to-end run + a deployed `modal_app` app name / GPU type confirmation.
- Bake-asset provenance for the judge image (ckpt + held-out CSGO subset + cached
baseline metrics); the held-out episode ids are the only hidden component.
- CI smoke (`FRONTIER_NWM_SMOKE=1`, CPU) validates the patch policy + empty
reference; confirm this matches the repo's CI expectation for GPU/Modal tasks.
64 changes: 64 additions & 0 deletions 2.0/problems/nanowm_rollout_speedup/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
tag: systems
runtime:
# Submission is a Python-only source patch (the real reference is
# reference.patch). `language: python` keeps the file extension/CLI conventions
# standard (mirrors vllm_llm_serving_optimization, #145); there is no separate
# "patch" language in the framework.
language: python
timeout_seconds: 21600
environment: >-
Python-only patch against a clean NanoWM checkout (Nano World Models,
arXiv:2605.23993); Modal GPU runs the NanoWM-L/2 CSGO 50-frame long-rollout;
speedup-vs-baseline judge with an LPIPS rollout-quality guardrail
apt_packages:
- bash
- ca-certificates
- curl
- git
- python3
- python3-pip
judge_apt_packages:
- bash
- ca-certificates
- curl
- git
- python3
- python3-pip
judge_pip_packages:
- modal
docker:
# Experimental local images; build with docker/build_images.sh before a local
# Harbor trial. Both bake a clean NanoWM checkout + the L/2 CSGO ckpt; the
# judge image additionally vendors the held-out CSGO episode subset, the
# LPIPS scorer, and the cached vanilla baseline metrics.
image: frontiercs/nanowm-rollout-speedup-agent:experimental-v0
judge_image: frontiercs/nanowm-rollout-speedup-judge:experimental-v0
environment:
cpus: 8
memory_mb: 32768
storage_mb: 32768
build_timeout_seconds: 5400
evaluation:
# GPU served on Modal (one per environment); judge container is CPU-only.
model: nanowm_l2_csgo
dataset: game/csgo
gpu: L40S
# FIXED rollout invocation (the agent's patch changes sampler internals, not these).
rollout_length: 50
history_length: 4
num_steps: 50 # nominal reference DDIM budget
scheduling: sequential
history_stab: 0.02
# Quality guardrail: patched rollout LPIPS-vs-GT may rise at most this
# (relative) above the unpatched seq@50 baseline before the score is penalized.
# Calibrated: seq@20 is already +5% over seq@50, so a 3% tolerance forces real
# fast-sampling work (DPM-Solver++, caching, distillation), not naive step cuts.
quality_tolerance: 0.03
quick_clips: 4 # iterative (agent-role) public feedback
final_clips: 16 # final (verifier-role) evaluation
batch_size: 4
baseline_cache_path: /opt/nanowm/baseline/baseline_metrics.json
submission:
kind: file
path: /app/solution.patch
max_queue_size: 2
32 changes: 32 additions & 0 deletions 2.0/problems/nanowm_rollout_speedup/docker/agent/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Agent workspace image for nanowm_rollout_speedup (CPU-only client).
# The agent edits the NanoWM checkout under /app/nano-world-model and submits a
# patch; GPU evaluation happens on Modal (judge side). Build via ../build_images.sh.
FROM python:3.11-slim

ENV DEBIAN_FRONTEND=noninteractive PIP_NO_CACHE_DIR=1
RUN apt-get update && apt-get install -y --no-install-recommends \
git curl ca-certificates bash ripgrep patch && rm -rf /var/lib/apt/lists/*

# Pre-install Claude Code + Codex CLI (parity with other 2.0 agent images).
RUN curl -fsSL https://claude.ai/install.sh | bash -s -- && \
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
ENV NVM_DIR="/root/.nvm"
RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash && \
. "$NVM_DIR/nvm.sh" && nvm install 22 && nvm alias default 22 && \
npm install -g @openai/codex@latest && ln -sf "$(which node)" /usr/local/bin/node

WORKDIR /app
# Clean NanoWM checkout the agent patches (pinned commit; native build not needed).
ARG NANOWM_COMMIT=main
RUN git clone https://github.com/simchowitzlabpublic/nano-world-model /app/nano-world-model && \
cd /app/nano-world-model && git checkout "$NANOWM_COMMIT"
# Task-local infra patches the agent builds on (decode-OOM + get_seq_length);
# applied so the harness rollout works, outside the agent's editable scope.
COPY task_ctx/infra_patches/ /tmp/infra_patches/
RUN cd /app/nano-world-model && for p in /tmp/infra_patches/*.patch; do \
[ -e "$p" ] && git apply "$p" || true; done && \
git add -A && git -c user.email=t@e -c user.name=t commit -q -m base || true

COPY task_ctx/harbor_app/ /app/
COPY task_ctx/task_pkg/ /app/task/
RUN chmod +x /app/*.sh 2>/dev/null || true
32 changes: 32 additions & 0 deletions 2.0/problems/nanowm_rollout_speedup/docker/build_images.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/usr/bin/env bash
# Build the nanowm_rollout_speedup agent + judge images.
# bash 2.0/problems/nanowm_rollout_speedup/docker/build_images.sh [tag]
# Requires the hidden assets staged under $NWM_ASSETS (default below):
# ckpts/nanowm-l2-csgo/model_state_dict.pt
# data/csgo/1-200/*.hdf5 (held-out CSGO episode subset)
# data/csgo_subset/{val_files.txt,val_starts.npy}
# baseline/baseline_metrics.json (optional; computed on first trial otherwise)
set -euo pipefail
TAG="${1:-experimental-v0}"
NANOWM_COMMIT="${NANOWM_COMMIT:-main}"
SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
PROB=$(cd "$SCRIPT_DIR/.." && pwd)
NWM_ASSETS="${NWM_ASSETS:-/scratch/gpfs/KARTHIKN/wc9403/project/fcs-2/data}"

CTX=$(mktemp -d); trap 'rm -rf "$CTX"' EXIT
mkdir -p "$CTX/task_ctx"
cp -r "$PROB/speedup_eval" "$CTX/task_ctx/task_pkg"
cp "$PROB/evaluator.py" "$CTX/task_ctx/evaluator.py"
cp -r "$PROB/harbor/app" "$CTX/task_ctx/harbor_app"
cp -r "$PROB/infra_patches" "$CTX/task_ctx/infra_patches" 2>/dev/null || mkdir -p "$CTX/task_ctx/infra_patches"
# judge-only hidden assets
mkdir -p "$CTX/task_ctx/assets"
cp -r "$NWM_ASSETS/ckpts/nanowm-l2-csgo-100k" "$CTX/task_ctx/assets/ckpts/nanowm-l2-csgo" 2>/dev/null || true
cp -r "$NWM_ASSETS/csgo" "$CTX/task_ctx/assets/data/csgo" 2>/dev/null || true
cp -r "$NWM_ASSETS/csgo_subset" "$CTX/task_ctx/assets/data/csgo_subset" 2>/dev/null || true

docker build --target "" --build-arg NANOWM_COMMIT="$NANOWM_COMMIT" \
-t "frontiercs/nanowm-rollout-speedup-agent:$TAG" -f "$SCRIPT_DIR/agent/Dockerfile" "$CTX"
docker build --build-arg NANOWM_COMMIT="$NANOWM_COMMIT" \
-t "frontiercs/nanowm-rollout-speedup-judge:$TAG" -f "$SCRIPT_DIR/judge/Dockerfile" "$CTX"
echo "Built frontiercs/nanowm-rollout-speedup-{agent,judge}:$TAG"
35 changes: 35 additions & 0 deletions 2.0/problems/nanowm_rollout_speedup/docker/judge/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Judge image for nanowm_rollout_speedup (CPU-only; drives a Modal GPU).
# Bakes the task package + clean NanoWM checkout + L/2 CSGO ckpt + held-out CSGO
# episode subset + cached vanilla baseline; the GPU rollout runs on Modal.
FROM python:3.11-slim

ENV DEBIAN_FRONTEND=noninteractive PIP_NO_CACHE_DIR=1 FRONTIER_NWM_PKG=/opt/nanowm/task
RUN apt-get update && apt-get install -y --no-install-recommends \
git curl ca-certificates bash patch && rm -rf /var/lib/apt/lists/*
RUN pip install modal omegaconf hydra-core

WORKDIR /opt/nanowm
ARG NANOWM_COMMIT=main
RUN git clone https://github.com/simchowitzlabpublic/nano-world-model /opt/nanowm/nano-world-model && \
cd /opt/nanowm/nano-world-model && git checkout "$NANOWM_COMMIT"
COPY task_ctx/infra_patches/ /tmp/infra_patches/
RUN cd /opt/nanowm/nano-world-model && for p in /tmp/infra_patches/*.patch; do \
[ -e "$p" ] && git apply "$p" || true; done

# Hidden assets baked by build_images.sh into the build context:
# task_ctx/assets/ckpts/nanowm-l2-csgo/model_state_dict.pt
# task_ctx/assets/csgo/1-200/*.hdf5 (held-out episode subset)
# task_ctx/assets/csgo_subset/{val_files.txt,val_starts.npy}
# task_ctx/assets/baseline/baseline_metrics.json (optional precomputed)
COPY task_ctx/assets/ /opt/nanowm/
COPY task_ctx/task_pkg/ /opt/nanowm/task/
COPY task_ctx/evaluator.py /opt/nanowm/task/evaluator.py

# Modal builds the GPU image from this same checkout/assets at deploy time.
ENV FRONTIER_NWM_REPO=/opt/nanowm/nano-world-model \
FRONTIER_NWM_CKPT=/opt/nanowm/ckpts/nanowm-l2-csgo/model_state_dict.pt \
FRONTIER_NWM_CSGO_DATA=/opt/nanowm/data/csgo \
FRONTIER_NWM_VAL_FILES=/opt/nanowm/data/csgo_subset/val_files.txt \
FRONTIER_NWM_VAL_STARTS=/opt/nanowm/data/csgo_subset/val_starts.npy \
FRONTIER_NWM_BASELINE_CACHE=/opt/nanowm/baseline/baseline_metrics.json \
NWM_BAKE_DIR=/opt/nanowm
14 changes: 14 additions & 0 deletions 2.0/problems/nanowm_rollout_speedup/evaluate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/usr/bin/env bash
# Local CLI evaluation for nanowm_rollout_speedup.
# Full runs need a GPU (local backend) or Modal credentials, plus the baked
# NanoWM assets (FRONTIER_NWM_* env, see speedup_eval/settings.py). Without a GPU
# this falls back to the smoke path (patch-policy validation only), which is what
# repository CI exercises.
set -euo pipefail
SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
export FRONTIER_NWM_PKG="${FRONTIER_NWM_PKG:-$SCRIPT_DIR}"
if ! python3 -c 'import torch; assert torch.cuda.is_available()' >/dev/null 2>&1; then
export FRONTIER_NWM_SMOKE="${FRONTIER_NWM_SMOKE:-1}"
fi
SOLUTION="${1:-$SCRIPT_DIR/reference.patch}"
exec python3 "$SCRIPT_DIR/evaluator.py" "$SOLUTION"
Loading
Loading