FrontierCS · momoway · Jun 11, 2026 · Jun 11, 2026
diff --git a/2.0/README.md b/2.0/README.md
@@ -47,6 +47,21 @@ applies the submitted patch to a clean skeleton, runs a hidden arena against
 multiple baseline bot families, and scores by mean baseline win rate with a
 small faster-win tiebreak. The online generals.io service is not used.
 
+## vLLM LLM-Serving Optimization
+
+This systems problem asks agents to patch a clean upstream vLLM checkout to
+reduce the end-to-end latency of an LLM serving system on a multi-turn agentic
+workload, while keeping accuracy near a baseline. Its problem ID is
+`vllm_llm_serving_optimization`. The served model is
+`meta-llama/Llama-3.1-8B-Instruct` on a single Modal L40S, and the workload is a
+mini-swe-agent SWE-bench run. The agent submits a Python-only patch and can run
+an async public test (a subset of the final eval set) that returns real latency
+and accuracy feedback. Scoring is the geometric-mean latency speedup versus a
+vanilla-vLLM baseline, gated by an accuracy guardrail: accuracy within 5% of the
+baseline does not affect the score, and beyond that the score decays
+inverse-proportionally with the accuracy drop. Like duckdb-e2e, the agent and
+judge run in separate Docker environments.
+
 ## BBOPlace ISPD2005
 
 This VLSI placement problem asks agents to generate macro placement candidates

diff --git a/2.0/problems/vllm_llm_serving_optimization/.dockerignore b/2.0/problems/vllm_llm_serving_optimization/.dockerignore
@@ -0,0 +1,8 @@
+**/__pycache__
+**/*.pyc
+harbor/app/.public_test
+docs
+docker/README.md
+*.md
+reference.patch
+harbor/app/solution.patch
diff --git a/2.0/problems/vllm_llm_serving_optimization/DESIGN.md b/2.0/problems/vllm_llm_serving_optimization/DESIGN.md
@@ -0,0 +1,250 @@
+# vLLM LLM-Serving Optimization — Design & Operations
+
+A Frontier-CS **2.0** systems task. The agent patches a **clean upstream vLLM
+v0.11.0** checkout (Python-only) to reduce the **end-to-end latency** of an LLM
+serving system on a multi-turn agentic workload, while keeping task-solving
+**accuracy** close to a vanilla-vLLM baseline. The served model is
+`meta-llama/Llama-3.1-8B-Instruct` on a single **NVIDIA L40S** provisioned
+on-demand through [Modal](https://modal.com/docs).
+
+> **Validated end-to-end (2026-06-11):** a full Harbor trial with the `codex`
+> agent (`gpt-5.5`) produced a real **1.79× latency geomean speedup** over the
+> baseline at full eval scale (30 SWE-bench instances), accuracy preserved →
+> **score 83.89 / 100**.
+
+---
+
+## 1. Current Setting
+
+All knobs live in `config.yaml` (`evaluation` block) and are baked into the
+judge/agent images as `task_config.json`.
+
+| Parameter | Value | Notes |
+|---|---|---|
+| Served model | `meta-llama/Llama-3.1-8B-Instruct` | gated; HF token required |
+| Serving GPU | **1× NVIDIA L40S** (via Modal) | one GPU per environment |
+| Workload | mini-swe-agent on `princeton-nlp/SWE-bench_Verified` (split `test`) | multi-turn, shared-prefix conversations |
+| Arrival | Poisson, `jps = 0.5` jobs/s | concurrent in-flight conversations |
+| `public_slice` (agent role) | `0:5` | iterative self-test subset |
+| `eval_slice` (final role) | `0:30` | full verification; superset of public |
+| Decoding | `temperature = 0`, `max_completion_tokens = 2048` | greedy, deterministic |
+| `step_limit` | 50 | per-instance agent steps |
+| Accuracy (agent role) | `patch_validity` | cheap proxy for iterative feedback |
+| Accuracy (final role) | `resolve_rate` | **real** SWE-bench resolved fraction — judge mounts the host Docker socket (DooD) and runs the swebench harness against prebuilt testbed images; falls back to `patch_validity` only if no Docker daemon is reachable |
+| `accuracy_tolerance` | `0.05` | ≤5% relative drop ⇒ no penalty |
+| `correctness_smoke_prompts` | 8 | greedy outputs must match baseline token-for-token |
+| Build timeout / per-instance timeout | 5400 s / 1200 s | |
+| Submission | file `/app/solution.patch` (git diff vs `/app/vllm`), `max_queue_size = 2` | async |
+| Container budget | 8 vCPU, 32 GiB RAM, 64 GiB storage | agent **and** judge; GPU is remote on Modal |
+
+**Two roles, two scales.** *Agent role* (iterative `submit.sh` / `public_test`)
+uses `public_slice` + `patch_validity`; *final role* (the Harbor verifier) uses
+`eval_slice` + `resolve_rate`. The public subset is a strict subset of the final
+set, so the self-test is a fast, faithful proxy.
+
+---
+
+## 2. Scoring
+
+The judge serves **baseline (vanilla vLLM)** and the **patched build** on the
+same L40S, under the same workload and the same arrival schedule, and measures
+per-instance end-to-end latency (arrival of an instance's first request →
+completion of its last response), client-side.
+
+**Hard gates → score 0** (checked before any timing):
+1. **Patch policy** (see §3) — disallowed file, non-Python, secret access, or
+   benchmark hard-coding.
+2. **Build** — the patched source must build on Modal (`VLLM_USE_PRECOMPILED`).
+3. **Server health** — `/v1/models` must come up.
+4. **Correctness** — the patched server's greedy outputs must match the baseline
+   **token-for-token** at `temperature 0` on a small smoke set. An optimization
+   must not change what the model generates.
+
+**Latency score** (primary objective — geometric mean of per-instance speedups):
+```
+per_instance_speedup[i] = baseline_latency[i] / patched_latency[i]   # floored at 0.01
+latency_speedup         = geomean(per_instance_speedup)
+latency_score           = clip(100 * log2(latency_speedup), 0, 100)
+```
+`1.0×` → 0 points, `2.0×` → 100 points, regressions → 0. Geomean rewards broad
+speedups over a single large outlier.
+
+**Accuracy guardrail** (multiplier):
+```
+rel_drop = max(0, (baseline_accuracy - patched_accuracy) / baseline_accuracy)
+acc_mult = 1.0                       if rel_drop <= 0.05      # within 5% → no penalty
+acc_mult = clip(0.05 / rel_drop, 0, 1)  otherwise            # inverse-proportional decay
+```
+
+**Final score**:
+```
+score  = clip(latency_score * acc_mult, 0, 100)
+reward = score / 100        # Harbor reward.txt
+```
+A fast build that degrades task quality loses most of its score; a build within
+5% of baseline accuracy is scored purely on its latency improvement.
+
+Authoritative scorer: `evaluator.py` (`full_evaluation`); `serving_eval/scoring.py`
+mirrors it for the agent-side public test's provisional score. When the serving
+stack is unconfigured (no Modal/clean source, e.g. local CI), the evaluator
+returns a `1.0` smoke score so the empty reference patch passes.
+
+---
+
+## 3. Which vLLM files the model may change (Patch Policy)
+
+The patch is validated **before** building. Build uses `VLLM_USE_PRECOMPILED=1`,
+so **only Python source is allowed** (`.py`, `.pyi`); no CUDA/C++, build-system,
+packaging, or dependency changes. New Python files inside allowed areas are OK.
+
+**Strongly allowed** (core scheduling / batching / KV-cache):
+```
+vllm/v1/core/**
+vllm/v1/core/sched/**
+vllm/v1/core/kv_cache_utils.py
+vllm/config/scheduler.py
+vllm/config/cache.py
+```
+
+**Conditionally allowed** (narrow wiring around the engine / request path):
+```
+vllm/v1/worker/**          vllm/v1/engine/**         vllm/v1/executor/**
+vllm/v1/request.py         vllm/v1/outputs.py        vllm/v1/serial_utils.py
+vllm/entrypoints/openai/protocol.py
+vllm/entrypoints/openai/serving_engine.py
+vllm/entrypoints/openai/serving_chat.py
+vllm/entrypoints/openai/serving_completion.py
+vllm/sampling_params.py
+```
+
+**Denied** (rejected outright):
+```
+csrc/** cmake/** CMakeLists.txt setup.py setup.cfg pyproject.toml
+requirements/** requirements*.txt
+tests/** benchmarks/** docs/** examples/** tools/** .github/** docker/** Dockerfile*
+vllm/model_executor/models/**     vllm/model_executor/model_loader/**
+vllm/transformers_utils/**  vllm/lora/**  vllm/distributed/**
+vllm/entrypoints/llm.py  vllm/entrypoints/api_server.py  vllm/entrypoints/cli/**
+vllm/version.py  vllm/_version.py
+```
+
+**Also rejected:** reading/writing judge/Modal/HF/Frontier/Harbor environment
+variables (`MODAL_TOKEN*`, `HF_TOKEN`, `FRONTIER_*`, `HARBOR_*`, `JUDGE_URL`,
+`RUN_OUTPUT_DIR`, scheduler-timestamp leakage), and hard-coding the benchmark /
+dataset / instance ids / judge paths (`swebench`, `princeton-nlp`,
+`SWE-bench_Verified`, `minisweagent`, …). The server is launched under a fixed
+config; patches that detect the benchmark, sleep, short-circuit generation, or
+otherwise special-case the evaluation are rejected.
+
+> **In practice:** the intended optimization area is *online serving efficiency*
+> — request scheduling, batching, KV-cache management, prefix/prompt-cache reuse,
+> preemption/admission control, queueing, and closely related scheduler/execution
+> wiring. The validated 1.79× run was a single-file change to
+> `vllm/v1/core/sched/scheduler.py`. (Candidate variants during the run also
+> touched `vllm/v1/core/kv_cache_utils.py`, `vllm/v1/core/kv_cache_manager.py`,
+> and `vllm/config/scheduler.py` — all within the allowlist.)
+
+---
+
+## 4. GPU resource management & scheduling (Modal)
+
+**No local GPU.** The agent and judge containers are CPU-only clients
+(8 vCPU / 32 GiB). The single L40S is provisioned **on-demand on Modal** and is
+the *only* place the model runs. This is what makes the agent/judge split cheap
+to host.
+
+### Image build (per submission)
+`serving_eval/modal_app.py` defines a Modal app parametrized entirely via env
+vars (so the same module serves baseline and patched trees):
+- Base `nvidia/cuda:12.9.0-devel-ubuntu22.04` (+ Python 3.12, `uv`).
+- `add_local_dir(<vllm_src>, /src/vllm, copy=True)` bakes the **target source tree**
+  into the image (`copy=True` is required because the next step installs from it).
+- `VLLM_USE_PRECOMPILED=1 uv pip install --system -e .` — reuses vLLM's prebuilt
+  CUDA kernels and rebuilds only the Python layer ⇒ per-submission builds are
+  minutes, not an hour, and the **Python-only patch policy is enforced by
+  construction**.
+- Pinned for reproducibility on a shallow/patched tree:
+  `SETUPTOOLS_SCM_PRETEND_VERSION*` (version detection), a pinned
+  `VLLM_PRECOMPILED_WHEEL_LOCATION` (ABI-matched release wheel — the default
+  derivation falls back to an incompatible nightly), `transformers==4.55.2`
+  (the unpinned upper bound otherwise resolves to an incompatible 5.x), and
+  `hf_transfer`.
+
+### Serving
+```python
+@app.function(gpu="L40S", scaledown_window=900, secrets=[huggingface-secret],
+              volumes={hf_cache, vllm_cache})
+@modal.concurrent(max_inputs=64)
+@modal.web_server(port=8000, startup_timeout=...)
+def serve(): subprocess.Popen("vllm serve <model> --host 0.0.0.0 --port 8000 ...")
+```
+- `gpu="L40S"` requests exactly one L40S; `@modal.concurrent(64)` lets one
+  warm container handle many in-flight requests (matching the Poisson workload).
+- `@modal.web_server` exposes vLLM's OpenAI endpoint at a stable
+  `https://…modal.run/v1`; Modal cold-starts the container on first request and
+  serves within `startup_timeout`.
+- **Persisted caches:** a `huggingface` Volume (weights downloaded once, reused
+  across cold starts) and a `vllm` cache Volume.
+- `scaledown_window=900` releases the idle GPU after 15 min — you pay for GPU
+  only while serving/measuring.
+
+### Lifecycle & scheduling (`serving_eval/serving.py`)
+```
+deploy_server() → `modal deploy modal_app.py` (env selects src/model/app-name)
+               → Function.from_name(app, "serve").get_web_url()
+wait_healthy() → poll /v1/models until 200
+... run workload ...
+stop_server()  → `modal app stop <app>`
+```
+- **One L40S per environment is honored by serializing:** baseline and patched
+  are **never served concurrently**. The baseline is measured once and cached
+  (`/opt/vllm-baseline/baseline_metrics.json`); the patched build is then served
+  on its own and its greedy outputs are compared against the cached baseline.
+- **Transient-failure retry:** Modal occasionally evicts an image build under
+  load (`Image build terminated due to external shut-down`, `APP_STATE_STOPPED`,
+  gateway timeouts). `deploy_server` retries such transient deploys with backoff
+  (`deploy_retries`, default 3), running `modal app stop` between attempts; a
+  genuine build error in the patch is non-transient and fails fast.
+- Auth inside the containers is env-var based (`MODAL_TOKEN_ID` /
+  `MODAL_TOKEN_SECRET`); gated Llama weights are pulled inside the Modal serving
+  container via the Modal Secret `huggingface-secret` (key `HF_TOKEN`).
+
+### Where Modal is used from
+Both the **agent's async public test** (`harbor/app/public_test.py` →
+`serving_eval.run_public_test`) and the **judge's measurement**
+(`evaluator.py` → `serving_eval.run_measurement`) drive Modal the same way, so
+the iterative feedback the agent sees is the same kind the judge grades on.
+
+### Real resolve-rate (Docker-out-of-Docker) — separate from the GPU
+
+Accuracy is *task-solving* quality, not a GPU concern: the **CPU-side** SWE-bench
+evaluation runs locally, not on Modal. For the final role the judge mounts the
+**host Docker socket** (`/var/run/docker.sock`) so it can run two things against
+real per-instance testbeds:
+- the **workload sandbox** (`serving_eval/sandbox.py` `DockerSandbox`) — the
+  agent's shell commands execute inside `swebench/sweb.eval.x86_64.<instance>`
+  at `/testbed` (network-isolated), instead of the `LocalSandbox` fallback;
+- the **resolve harness** (`serving_eval/accuracy.py` → `swebench.harness.
+  run_evaluation`, `namespace="swebench"`, `modal=False`) — pulls the prebuilt
+  eval image, applies the model's patch, runs the repo's `FAIL_TO_PASS` tests,
+  and reports the **resolved fraction** (`proxy_used=False`).
+
+These testbed containers run as **siblings on the host daemon**, fully separate
+from the Modal L40S that serves the model. Cost note: each eval image is
+~2–8 GB and a resolve takes ~2 min/instance, so a full `eval_slice 0:30`
+resolve pulls ~100+ GB of images. Without the socket (e.g. local CI) the judge
+auto-degrades to `patch_validity` and flags `proxy_used=True`.
+
+---
+
+## File map
+
+```
+config.yaml          resources, model, L40S, dataset, eval knobs (→ task_config.json)
+readme               public problem statement (no algorithm hints)
+evaluator.py         patch policy + scoring + orchestration (+ local smoke degrade)
+serving_eval/        settings · modal_app · serving · sandbox · agent_runner ·
+                     accuracy · correctness · scoring · measure
+docker/              agent + judge Dockerfiles, build/smoke scripts
+harbor/app/          make_submission.sh, public_test client
+```
diff --git a/2.0/problems/vllm_llm_serving_optimization/config.yaml b/2.0/problems/vllm_llm_serving_optimization/config.yaml
@@ -0,0 +1,79 @@
+tag: systems
+runtime:
+  language: python
+  timeout_seconds: 21600
+  environment: "Patched vLLM (v0.11.0) source; Modal L40S GPU serving Llama-3.1-8B-Instruct; mini-swe-agent SWE-bench workload; latency-primary judge with accuracy guardrail"
+  apt_packages:
+    - bash
+    - ca-certificates
+    - curl
+    - git
+    - python3
+    - python3-pip
+  judge_apt_packages:
+    - bash
+    - ca-certificates
+    - curl
+    - git
+    - python3
+    - python3-pip
+  judge_pip_packages:
+    - modal
+    - openai
+    - datasets
+    - huggingface-hub
+  docker:
+    # Experimental local images. Build them with
+    # 2.0/problems/vllm_llm_serving_optimization/docker/build_images.sh before running a
+    # local Harbor trial. Both images need a clean upstream vLLM v0.11.0 checkout
+    # (NOT the continuum fork). The judge image additionally vendors the
+    # mini-swe-agent harness and the latency/accuracy scorer.
+    image: frontiercs/vllm-serving-optimization-agent:experimental-v0.11.0
+    judge_image: frontiercs/vllm-serving-optimization-judge:experimental-v0.11.0
+environment:
+  cpus: 8
+  memory_mb: 32768
+  storage_mb: 65536
+  build_timeout_seconds: 7200
+evaluation:
+  # Model + accelerator served on Modal (one L40S per environment).
+  model: meta-llama/Llama-3.1-8B-Instruct
+  gpu: L40S
+  # Workload: mini-swe-agent on SWE-bench Verified (split test).
+  dataset: princeton-nlp/SWE-bench_Verified
+  dataset_split: test
+  # Iterative (agent-role) public test: a strict subset of the final eval set.
+  public_slice: "0:5"
+  # Final (verifier-role) evaluation: superset of the public slice.
+  eval_slice: "0:30"
+  # Poisson arrival workload (jobs/second). Mirrors a realistic serving load.
+  arrival_mode: jps
+  jps: 0.5
+  workers: 8
+  step_limit: 50
+  temperature: 0.0
+  max_completion_tokens: 2048
+  # Latency aggregation + scoring.
+  latency_metric: mean_e2e_seconds
+  # Accuracy guardrail. Within `accuracy_tolerance` relative drop of baseline =>
+  # no penalty; beyond it the score decays inverse-proportionally.
+  accuracy_tolerance: 0.05
+  agent_accuracy_mode: patch_validity
+  final_accuracy_mode: resolve_rate
+  # Greedy-output correctness smoke (a handful of fixed prompts must match the
+  # baseline token-for-token at temperature 0 before timing is considered).
+  correctness_smoke_prompts: 8
+  # Modal serving knobs.
+  modal_scaledown_seconds: 900
+  modal_startup_timeout_seconds: 1200
+  server_health_timeout_seconds: 1800
+  # Per-phase wall-clock budgets (seconds).
+  build_timeout_seconds: 5400
+  instance_timeout_seconds: 1200
+  # Use a baseline (vanilla vLLM) cached in the judge image when available,
+  # otherwise the judge serves vanilla once and caches it for the trial.
+  baseline_cache_path: /opt/vllm-baseline/baseline_metrics.json
+submission:
+  kind: file
+  path: /app/solution.patch
+  max_queue_size: 2