diff --git a/2.0/README.md b/2.0/README.md
index b414435d..ea2deeed 100644
--- a/2.0/README.md
+++ b/2.0/README.md
@@ -47,6 +47,21 @@ applies the submitted patch to a clean skeleton, runs a hidden arena against
 multiple baseline bot families, and scores by mean baseline win rate with a
 small faster-win tiebreak. The online generals.io service is not used.
 
+## vLLM LLM-Serving Optimization
+
+This systems problem asks agents to patch a clean upstream vLLM checkout to
+reduce the end-to-end latency of an LLM serving system on a multi-turn agentic
+workload, while keeping accuracy near a baseline. Its problem ID is
+`vllm_llm_serving_optimization`. The served model is
+`meta-llama/Llama-3.1-8B-Instruct` on a single Modal L40S, and the workload is a
+mini-swe-agent SWE-bench run. The agent submits a Python-only patch and can run
+an async public test (a subset of the final eval set) that returns real latency
+and accuracy feedback. Scoring is the geometric-mean latency speedup versus a
+vanilla-vLLM baseline, gated by an accuracy guardrail: accuracy within 5% of the
+baseline does not affect the score, and beyond that the score decays
+inverse-proportionally with the accuracy drop. Like duckdb-e2e, the agent and
+judge run in separate Docker environments.
+
 ## BBOPlace ISPD2005
 
 This VLSI placement problem asks agents to generate macro placement candidates
diff --git a/2.0/problems/vllm_llm_serving_optimization/.dockerignore b/2.0/problems/vllm_llm_serving_optimization/.dockerignore
new file mode 100644
index 00000000..eedfcafa
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/.dockerignore
@@ -0,0 +1,8 @@
+**/__pycache__
+**/*.pyc
+harbor/app/.public_test
+docs
+docker/README.md
+*.md
+reference.patch
+harbor/app/solution.patch
diff --git a/2.0/problems/vllm_llm_serving_optimization/DESIGN.md b/2.0/problems/vllm_llm_serving_optimization/DESIGN.md
new file mode 100644
index 00000000..1711269c
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/DESIGN.md
@@ -0,0 +1,250 @@
+# vLLM LLM-Serving Optimization — Design & Operations
+
+A Frontier-CS **2.0** systems task. The agent patches a **clean upstream vLLM
+v0.11.0** checkout (Python-only) to reduce the **end-to-end latency** of an LLM
+serving system on a multi-turn agentic workload, while keeping task-solving
+**accuracy** close to a vanilla-vLLM baseline. The served model is
+`meta-llama/Llama-3.1-8B-Instruct` on a single **NVIDIA L40S** provisioned
+on-demand through [Modal](https://modal.com/docs).
+
+> **Validated end-to-end (2026-06-11):** a full Harbor trial with the `codex`
+> agent (`gpt-5.5`) produced a real **1.79× latency geomean speedup** over the
+> baseline at full eval scale (30 SWE-bench instances), accuracy preserved →
+> **score 83.89 / 100**.
+
+---
+
+## 1. Current Setting
+
+All knobs live in `config.yaml` (`evaluation` block) and are baked into the
+judge/agent images as `task_config.json`.
+
+| Parameter | Value | Notes |
+|---|---|---|
+| Served model | `meta-llama/Llama-3.1-8B-Instruct` | gated; HF token required |
+| Serving GPU | **1× NVIDIA L40S** (via Modal) | one GPU per environment |
+| Workload | mini-swe-agent on `princeton-nlp/SWE-bench_Verified` (split `test`) | multi-turn, shared-prefix conversations |
+| Arrival | Poisson, `jps = 0.5` jobs/s | concurrent in-flight conversations |
+| `public_slice` (agent role) | `0:5` | iterative self-test subset |
+| `eval_slice` (final role) | `0:30` | full verification; superset of public |
+| Decoding | `temperature = 0`, `max_completion_tokens = 2048` | greedy, deterministic |
+| `step_limit` | 50 | per-instance agent steps |
+| Accuracy (agent role) | `patch_validity` | cheap proxy for iterative feedback |
+| Accuracy (final role) | `resolve_rate` | **real** SWE-bench resolved fraction — judge mounts the host Docker socket (DooD) and runs the swebench harness against prebuilt testbed images; falls back to `patch_validity` only if no Docker daemon is reachable |
+| `accuracy_tolerance` | `0.05` | ≤5% relative drop ⇒ no penalty |
+| `correctness_smoke_prompts` | 8 | greedy outputs must match baseline token-for-token |
+| Build timeout / per-instance timeout | 5400 s / 1200 s | |
+| Submission | file `/app/solution.patch` (git diff vs `/app/vllm`), `max_queue_size = 2` | async |
+| Container budget | 8 vCPU, 32 GiB RAM, 64 GiB storage | agent **and** judge; GPU is remote on Modal |
+
+**Two roles, two scales.** *Agent role* (iterative `submit.sh` / `public_test`)
+uses `public_slice` + `patch_validity`; *final role* (the Harbor verifier) uses
+`eval_slice` + `resolve_rate`. The public subset is a strict subset of the final
+set, so the self-test is a fast, faithful proxy.
+
+---
+
+## 2. Scoring
+
+The judge serves **baseline (vanilla vLLM)** and the **patched build** on the
+same L40S, under the same workload and the same arrival schedule, and measures
+per-instance end-to-end latency (arrival of an instance's first request →
+completion of its last response), client-side.
+
+**Hard gates → score 0** (checked before any timing):
+1. **Patch policy** (see §3) — disallowed file, non-Python, secret access, or
+   benchmark hard-coding.
+2. **Build** — the patched source must build on Modal (`VLLM_USE_PRECOMPILED`).
+3. **Server health** — `/v1/models` must come up.
+4. **Correctness** — the patched server's greedy outputs must match the baseline
+   **token-for-token** at `temperature 0` on a small smoke set. An optimization
+   must not change what the model generates.
+
+**Latency score** (primary objective — geometric mean of per-instance speedups):
+```
+per_instance_speedup[i] = baseline_latency[i] / patched_latency[i]   # floored at 0.01
+latency_speedup         = geomean(per_instance_speedup)
+latency_score           = clip(100 * log2(latency_speedup), 0, 100)
+```
+`1.0×` → 0 points, `2.0×` → 100 points, regressions → 0. Geomean rewards broad
+speedups over a single large outlier.
+
+**Accuracy guardrail** (multiplier):
+```
+rel_drop = max(0, (baseline_accuracy - patched_accuracy) / baseline_accuracy)
+acc_mult = 1.0                       if rel_drop <= 0.05      # within 5% → no penalty
+acc_mult = clip(0.05 / rel_drop, 0, 1)  otherwise            # inverse-proportional decay
+```
+
+**Final score**:
+```
+score  = clip(latency_score * acc_mult, 0, 100)
+reward = score / 100        # Harbor reward.txt
+```
+A fast build that degrades task quality loses most of its score; a build within
+5% of baseline accuracy is scored purely on its latency improvement.
+
+Authoritative scorer: `evaluator.py` (`full_evaluation`); `serving_eval/scoring.py`
+mirrors it for the agent-side public test's provisional score. When the serving
+stack is unconfigured (no Modal/clean source, e.g. local CI), the evaluator
+returns a `1.0` smoke score so the empty reference patch passes.
+
+---
+
+## 3. Which vLLM files the model may change (Patch Policy)
+
+The patch is validated **before** building. Build uses `VLLM_USE_PRECOMPILED=1`,
+so **only Python source is allowed** (`.py`, `.pyi`); no CUDA/C++, build-system,
+packaging, or dependency changes. New Python files inside allowed areas are OK.
+
+**Strongly allowed** (core scheduling / batching / KV-cache):
+```
+vllm/v1/core/**
+vllm/v1/core/sched/**
+vllm/v1/core/kv_cache_utils.py
+vllm/config/scheduler.py
+vllm/config/cache.py
+```
+
+**Conditionally allowed** (narrow wiring around the engine / request path):
+```
+vllm/v1/worker/**          vllm/v1/engine/**         vllm/v1/executor/**
+vllm/v1/request.py         vllm/v1/outputs.py        vllm/v1/serial_utils.py
+vllm/entrypoints/openai/protocol.py
+vllm/entrypoints/openai/serving_engine.py
+vllm/entrypoints/openai/serving_chat.py
+vllm/entrypoints/openai/serving_completion.py
+vllm/sampling_params.py
+```
+
+**Denied** (rejected outright):
+```
+csrc/** cmake/** CMakeLists.txt setup.py setup.cfg pyproject.toml
+requirements/** requirements*.txt
+tests/** benchmarks/** docs/** examples/** tools/** .github/** docker/** Dockerfile*
+vllm/model_executor/models/**     vllm/model_executor/model_loader/**
+vllm/transformers_utils/**  vllm/lora/**  vllm/distributed/**
+vllm/entrypoints/llm.py  vllm/entrypoints/api_server.py  vllm/entrypoints/cli/**
+vllm/version.py  vllm/_version.py
+```
+
+**Also rejected:** reading/writing judge/Modal/HF/Frontier/Harbor environment
+variables (`MODAL_TOKEN*`, `HF_TOKEN`, `FRONTIER_*`, `HARBOR_*`, `JUDGE_URL`,
+`RUN_OUTPUT_DIR`, scheduler-timestamp leakage), and hard-coding the benchmark /
+dataset / instance ids / judge paths (`swebench`, `princeton-nlp`,
+`SWE-bench_Verified`, `minisweagent`, …). The server is launched under a fixed
+config; patches that detect the benchmark, sleep, short-circuit generation, or
+otherwise special-case the evaluation are rejected.
+
+> **In practice:** the intended optimization area is *online serving efficiency*
+> — request scheduling, batching, KV-cache management, prefix/prompt-cache reuse,
+> preemption/admission control, queueing, and closely related scheduler/execution
+> wiring. The validated 1.79× run was a single-file change to
+> `vllm/v1/core/sched/scheduler.py`. (Candidate variants during the run also
+> touched `vllm/v1/core/kv_cache_utils.py`, `vllm/v1/core/kv_cache_manager.py`,
+> and `vllm/config/scheduler.py` — all within the allowlist.)
+
+---
+
+## 4. GPU resource management & scheduling (Modal)
+
+**No local GPU.** The agent and judge containers are CPU-only clients
+(8 vCPU / 32 GiB). The single L40S is provisioned **on-demand on Modal** and is
+the *only* place the model runs. This is what makes the agent/judge split cheap
+to host.
+
+### Image build (per submission)
+`serving_eval/modal_app.py` defines a Modal app parametrized entirely via env
+vars (so the same module serves baseline and patched trees):
+- Base `nvidia/cuda:12.9.0-devel-ubuntu22.04` (+ Python 3.12, `uv`).
+- `add_local_dir(<vllm_src>, /src/vllm, copy=True)` bakes the **target source tree**
+  into the image (`copy=True` is required because the next step installs from it).
+- `VLLM_USE_PRECOMPILED=1 uv pip install --system -e .` — reuses vLLM's prebuilt
+  CUDA kernels and rebuilds only the Python layer ⇒ per-submission builds are
+  minutes, not an hour, and the **Python-only patch policy is enforced by
+  construction**.
+- Pinned for reproducibility on a shallow/patched tree:
+  `SETUPTOOLS_SCM_PRETEND_VERSION*` (version detection), a pinned
+  `VLLM_PRECOMPILED_WHEEL_LOCATION` (ABI-matched release wheel — the default
+  derivation falls back to an incompatible nightly), `transformers==4.55.2`
+  (the unpinned upper bound otherwise resolves to an incompatible 5.x), and
+  `hf_transfer`.
+
+### Serving
+```python
+@app.function(gpu="L40S", scaledown_window=900, secrets=[huggingface-secret],
+              volumes={hf_cache, vllm_cache})
+@modal.concurrent(max_inputs=64)
+@modal.web_server(port=8000, startup_timeout=...)
+def serve(): subprocess.Popen("vllm serve <model> --host 0.0.0.0 --port 8000 ...")
+```
+- `gpu="L40S"` requests exactly one L40S; `@modal.concurrent(64)` lets one
+  warm container handle many in-flight requests (matching the Poisson workload).
+- `@modal.web_server` exposes vLLM's OpenAI endpoint at a stable
+  `https://…modal.run/v1`; Modal cold-starts the container on first request and
+  serves within `startup_timeout`.
+- **Persisted caches:** a `huggingface` Volume (weights downloaded once, reused
+  across cold starts) and a `vllm` cache Volume.
+- `scaledown_window=900` releases the idle GPU after 15 min — you pay for GPU
+  only while serving/measuring.
+
+### Lifecycle & scheduling (`serving_eval/serving.py`)
+```
+deploy_server() → `modal deploy modal_app.py` (env selects src/model/app-name)
+               → Function.from_name(app, "serve").get_web_url()
+wait_healthy() → poll /v1/models until 200
+... run workload ...
+stop_server()  → `modal app stop <app>`
+```
+- **One L40S per environment is honored by serializing:** baseline and patched
+  are **never served concurrently**. The baseline is measured once and cached
+  (`/opt/vllm-baseline/baseline_metrics.json`); the patched build is then served
+  on its own and its greedy outputs are compared against the cached baseline.
+- **Transient-failure retry:** Modal occasionally evicts an image build under
+  load (`Image build terminated due to external shut-down`, `APP_STATE_STOPPED`,
+  gateway timeouts). `deploy_server` retries such transient deploys with backoff
+  (`deploy_retries`, default 3), running `modal app stop` between attempts; a
+  genuine build error in the patch is non-transient and fails fast.
+- Auth inside the containers is env-var based (`MODAL_TOKEN_ID` /
+  `MODAL_TOKEN_SECRET`); gated Llama weights are pulled inside the Modal serving
+  container via the Modal Secret `huggingface-secret` (key `HF_TOKEN`).
+
+### Where Modal is used from
+Both the **agent's async public test** (`harbor/app/public_test.py` →
+`serving_eval.run_public_test`) and the **judge's measurement**
+(`evaluator.py` → `serving_eval.run_measurement`) drive Modal the same way, so
+the iterative feedback the agent sees is the same kind the judge grades on.
+
+### Real resolve-rate (Docker-out-of-Docker) — separate from the GPU
+
+Accuracy is *task-solving* quality, not a GPU concern: the **CPU-side** SWE-bench
+evaluation runs locally, not on Modal. For the final role the judge mounts the
+**host Docker socket** (`/var/run/docker.sock`) so it can run two things against
+real per-instance testbeds:
+- the **workload sandbox** (`serving_eval/sandbox.py` `DockerSandbox`) — the
+  agent's shell commands execute inside `swebench/sweb.eval.x86_64.<instance>`
+  at `/testbed` (network-isolated), instead of the `LocalSandbox` fallback;
+- the **resolve harness** (`serving_eval/accuracy.py` → `swebench.harness.
+  run_evaluation`, `namespace="swebench"`, `modal=False`) — pulls the prebuilt
+  eval image, applies the model's patch, runs the repo's `FAIL_TO_PASS` tests,
+  and reports the **resolved fraction** (`proxy_used=False`).
+
+These testbed containers run as **siblings on the host daemon**, fully separate
+from the Modal L40S that serves the model. Cost note: each eval image is
+~2–8 GB and a resolve takes ~2 min/instance, so a full `eval_slice 0:30`
+resolve pulls ~100+ GB of images. Without the socket (e.g. local CI) the judge
+auto-degrades to `patch_validity` and flags `proxy_used=True`.
+
+---
+
+## File map
+
+```
+config.yaml          resources, model, L40S, dataset, eval knobs (→ task_config.json)
+readme               public problem statement (no algorithm hints)
+evaluator.py         patch policy + scoring + orchestration (+ local smoke degrade)
+serving_eval/        settings · modal_app · serving · sandbox · agent_runner ·
+                     accuracy · correctness · scoring · measure
+docker/              agent + judge Dockerfiles, build/smoke scripts
+harbor/app/          make_submission.sh, public_test client
+```
diff --git a/2.0/problems/vllm_llm_serving_optimization/config.yaml b/2.0/problems/vllm_llm_serving_optimization/config.yaml
new file mode 100644
index 00000000..88a65a10
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/config.yaml
@@ -0,0 +1,79 @@
+tag: systems
+runtime:
+  language: python
+  timeout_seconds: 21600
+  environment: "Patched vLLM (v0.11.0) source; Modal L40S GPU serving Llama-3.1-8B-Instruct; mini-swe-agent SWE-bench workload; latency-primary judge with accuracy guardrail"
+  apt_packages:
+    - bash
+    - ca-certificates
+    - curl
+    - git
+    - python3
+    - python3-pip
+  judge_apt_packages:
+    - bash
+    - ca-certificates
+    - curl
+    - git
+    - python3
+    - python3-pip
+  judge_pip_packages:
+    - modal
+    - openai
+    - datasets
+    - huggingface-hub
+  docker:
+    # Experimental local images. Build them with
+    # 2.0/problems/vllm_llm_serving_optimization/docker/build_images.sh before running a
+    # local Harbor trial. Both images need a clean upstream vLLM v0.11.0 checkout
+    # (NOT the continuum fork). The judge image additionally vendors the
+    # mini-swe-agent harness and the latency/accuracy scorer.
+    image: frontiercs/vllm-serving-optimization-agent:experimental-v0.11.0
+    judge_image: frontiercs/vllm-serving-optimization-judge:experimental-v0.11.0
+environment:
+  cpus: 8
+  memory_mb: 32768
+  storage_mb: 65536
+  build_timeout_seconds: 7200
+evaluation:
+  # Model + accelerator served on Modal (one L40S per environment).
+  model: meta-llama/Llama-3.1-8B-Instruct
+  gpu: L40S
+  # Workload: mini-swe-agent on SWE-bench Verified (split test).
+  dataset: princeton-nlp/SWE-bench_Verified
+  dataset_split: test
+  # Iterative (agent-role) public test: a strict subset of the final eval set.
+  public_slice: "0:5"
+  # Final (verifier-role) evaluation: superset of the public slice.
+  eval_slice: "0:30"
+  # Poisson arrival workload (jobs/second). Mirrors a realistic serving load.
+  arrival_mode: jps
+  jps: 0.5
+  workers: 8
+  step_limit: 50
+  temperature: 0.0
+  max_completion_tokens: 2048
+  # Latency aggregation + scoring.
+  latency_metric: mean_e2e_seconds
+  # Accuracy guardrail. Within `accuracy_tolerance` relative drop of baseline =>
+  # no penalty; beyond it the score decays inverse-proportionally.
+  accuracy_tolerance: 0.05
+  agent_accuracy_mode: patch_validity
+  final_accuracy_mode: resolve_rate
+  # Greedy-output correctness smoke (a handful of fixed prompts must match the
+  # baseline token-for-token at temperature 0 before timing is considered).
+  correctness_smoke_prompts: 8
+  # Modal serving knobs.
+  modal_scaledown_seconds: 900
+  modal_startup_timeout_seconds: 1200
+  server_health_timeout_seconds: 1800
+  # Per-phase wall-clock budgets (seconds).
+  build_timeout_seconds: 5400
+  instance_timeout_seconds: 1200
+  # Use a baseline (vanilla vLLM) cached in the judge image when available,
+  # otherwise the judge serves vanilla once and caches it for the trial.
+  baseline_cache_path: /opt/vllm-baseline/baseline_metrics.json
+submission:
+  kind: file
+  path: /app/solution.patch
+  max_queue_size: 2
diff --git a/2.0/problems/vllm_llm_serving_optimization/docker/README.md b/2.0/problems/vllm_llm_serving_optimization/docker/README.md
new file mode 100644
index 00000000..1de1534b
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/docker/README.md
@@ -0,0 +1,74 @@
+# Experimental vLLM Serving-Optimization Images
+
+This task needs two images, mirroring the duckdb-e2e split: a public **agent**
+image and a private **judge** image. Both bundle a clean upstream vLLM checkout
+and the shared `serving_eval` harness. Build them before running a local Harbor
+trial:
+
+```bash
+bash 2.0/problems/vllm_llm_serving_optimization/docker/build_images.sh
+```
+
+Defaults:
+
+```text
+VLLM_REF=v0.11.0
+AGENT_TAG=frontiercs/vllm-serving-optimization-agent:experimental-v0.11.0
+JUDGE_TAG=frontiercs/vllm-serving-optimization-judge:experimental-v0.11.0
+```
+
+The agent image contains:
+
+```text
+/app/vllm                 # clean upstream vLLM (no continuum, no reference fix)
+/opt/serving_eval         # shared harness, used by the async public test
+/opt/vllm-baseline        # optional precomputed baseline cache
+```
+
+The judge image contains:
+
+```text
+/opt/vllm-clean           # clean upstream vLLM (build + baseline reference)
+/opt/serving_eval         # shared harness, used by the evaluator
+/opt/vllm-baseline        # baseline-metrics cache (filled on first measurement)
+```
+
+## Runtime requirements (important)
+
+Unlike duckdb-e2e, this task does **not** run the model inside the container.
+Both the agent public test and the judge serve `meta-llama/Llama-3.1-8B-Instruct`
+on a **Modal L40S** built from the (patched) vLLM source. The containers
+therefore need:
+
+- `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` in the environment (Modal auth). Use a
+  Modal service-user token for unattended runs.
+- A Modal Secret named `huggingface-secret` containing `HF_TOKEN` with access to
+  the gated Llama-3.1 weights (`modal secret create huggingface-secret HF_TOKEN=...`).
+  The container also reads `HF_TOKEN` for the `datasets` download.
+- The judge additionally needs a reachable Docker daemon (mounted socket or
+  DinD) to run the SWE-bench per-instance testbeds for the final resolve-rate.
+  When no daemon is reachable, the harness falls back to a local sandbox and the
+  patch-validity accuracy proxy.
+
+The Modal image build uses `VLLM_USE_PRECOMPILED=1`, so only vLLM's Python layer
+is rebuilt from the submitted source (minutes, not a full CUDA compile). This is
+why the patch policy is Python-only.
+
+## Baseline cache
+
+The judge measures the vanilla (clean-tree) baseline once per role and caches it
+at `/opt/vllm-baseline/baseline_metrics.json`, keyed by role (`agent` / `final`).
+Baseline and patched builds are never served simultaneously, so a single L40S is
+sufficient per environment. To precompute and bake the baseline into the image
+(recommended for faster trials), run the harness against the clean tree offline
+and copy the resulting `baseline_metrics.json` into the image at that path.
+
+## Smoke test
+
+```bash
+bash 2.0/problems/vllm_llm_serving_optimization/docker/smoke_images.sh
+```
+
+This checks that the clean vLLM checkout, the `serving_eval` package, and the
+Modal/OpenAI/datasets (and, for the judge, swebench + docker) clients are
+importable. It does not exercise Modal or a GPU.
diff --git a/2.0/problems/vllm_llm_serving_optimization/docker/agent/Dockerfile b/2.0/problems/vllm_llm_serving_optimization/docker/agent/Dockerfile
new file mode 100644
index 00000000..d06522f5
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/docker/agent/Dockerfile
@@ -0,0 +1,49 @@
+# Agent base image for the vLLM LLM-serving optimization task.
+#
+# Provides a CLEAN upstream vLLM checkout (no continuum / no reference solution)
+# for the agent to modify, the shared serving/eval harness used by the async
+# public test, and the Modal + OpenAI + datasets clients. The Frontier-CS 2.0
+# adapter builds the final agent image ON TOP of this one and copies the harbor
+# app scripts (make_submission.sh, public_test.sh, ...) into /app.
+#
+# Build context must be the task directory (so `serving_eval` is visible):
+#   docker build -f docker/agent/Dockerfile -t <agent_tag> .
+FROM ubuntu:24.04
+
+ARG VLLM_REF=v0.11.0
+ARG DEBIAN_FRONTEND=noninteractive
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        bash \
+        build-essential \
+        ca-certificates \
+        curl \
+        git \
+        python3 \
+        python3-pip \
+        python3-venv \
+        ripgrep && \
+    rm -rf /var/lib/apt/lists/*
+
+RUN pip3 install --break-system-packages --no-cache-dir \
+        modal \
+        openai \
+        "datasets>=2.19" \
+        huggingface-hub
+
+# Clean upstream vLLM source for the agent to modify. This is intentionally the
+# vanilla vLLM project at a pinned tag: the optimization must be re-derived by
+# the agent, not copied from any reference implementation.
+RUN git clone --branch "${VLLM_REF}" --depth 1 https://github.com/vllm-project/vllm.git /app/vllm
+
+# Shared serving/eval harness (importable as `serving_eval` from /opt) used by
+# the async public test client.
+COPY serving_eval /opt/serving_eval
+
+# Optional baseline-metrics cache. When present (precomputed on the clean tree),
+# the public test reports a provisional speedup/score; otherwise it reports raw
+# latency and accuracy only.
+RUN mkdir -p /opt/vllm-baseline
+
+WORKDIR /app
diff --git a/2.0/problems/vllm_llm_serving_optimization/docker/build_images.sh b/2.0/problems/vllm_llm_serving_optimization/docker/build_images.sh
new file mode 100755
index 00000000..a097e0e0
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/docker/build_images.sh
@@ -0,0 +1,26 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+TASK_DIR=$(cd "$SCRIPT_DIR/.." && pwd)
+
+VLLM_REF="${VLLM_REF:-v0.11.0}"
+AGENT_TAG="${AGENT_TAG:-frontiercs/vllm-serving-optimization-agent:experimental-v0.11.0}"
+JUDGE_TAG="${JUDGE_TAG:-frontiercs/vllm-serving-optimization-judge:experimental-v0.11.0}"
+
+# Build context is the task directory so the Dockerfiles can COPY serving_eval.
+docker build \
+  --build-arg "VLLM_REF=$VLLM_REF" \
+  -f "$TASK_DIR/docker/agent/Dockerfile" \
+  -t "$AGENT_TAG" \
+  "$TASK_DIR"
+
+docker build \
+  --build-arg "VLLM_REF=$VLLM_REF" \
+  -f "$TASK_DIR/docker/judge/Dockerfile" \
+  -t "$JUDGE_TAG" \
+  "$TASK_DIR"
+
+echo "Built:"
+echo "  $AGENT_TAG"
+echo "  $JUDGE_TAG"
diff --git a/2.0/problems/vllm_llm_serving_optimization/docker/judge/Dockerfile b/2.0/problems/vllm_llm_serving_optimization/docker/judge/Dockerfile
new file mode 100644
index 00000000..d7e702b7
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/docker/judge/Dockerfile
@@ -0,0 +1,49 @@
+# Judge base image for the vLLM LLM-serving optimization task.
+#
+# Contains the clean upstream vLLM source (the build/baseline reference), the
+# shared serving/eval harness, the SWE-bench resolve-rate harness, the Modal +
+# OpenAI + datasets clients, and the Docker CLI for per-instance SWE-bench
+# testbeds. The Frontier-CS 2.0 adapter builds the final judge image ON TOP of
+# this one (copying judge_server.py + problem_evaluator.py into /judge).
+#
+# Build context must be the task directory (so `serving_eval` is visible):
+#   docker build -f docker/judge/Dockerfile -t <judge_tag> .
+FROM ubuntu:24.04
+
+ARG VLLM_REF=v0.11.0
+ARG DEBIAN_FRONTEND=noninteractive
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        bash \
+        build-essential \
+        ca-certificates \
+        curl \
+        git \
+        python3 \
+        python3-pip \
+        python3-venv \
+        ripgrep \
+        docker.io && \
+    rm -rf /var/lib/apt/lists/*
+
+RUN pip3 install --break-system-packages --no-cache-dir \
+        modal \
+        openai \
+        "datasets>=2.19" \
+        huggingface-hub \
+        "swebench>=3.0"
+
+# Clean upstream vLLM source: the build/serve baseline and the tree the patch is
+# applied to. Must match the agent image's pinned tag.
+RUN git clone --branch "${VLLM_REF}" --depth 1 https://github.com/vllm-project/vllm.git /opt/vllm-clean
+
+# Shared serving/eval harness (importable as `serving_eval` from /opt).
+COPY serving_eval /opt/serving_eval
+
+# Baseline-metrics cache directory. The judge measures the vanilla baseline once
+# (or reads a precomputed cache here) and never serves baseline + patched at the
+# same time, honouring the one-L40S-per-environment budget.
+RUN mkdir -p /opt/vllm-baseline
+
+WORKDIR /judge
diff --git a/2.0/problems/vllm_llm_serving_optimization/docker/smoke_images.sh b/2.0/problems/vllm_llm_serving_optimization/docker/smoke_images.sh
new file mode 100755
index 00000000..6718dbc0
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/docker/smoke_images.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+AGENT_TAG="${AGENT_TAG:-frontiercs/vllm-serving-optimization-agent:experimental-v0.11.0}"
+JUDGE_TAG="${JUDGE_TAG:-frontiercs/vllm-serving-optimization-judge:experimental-v0.11.0}"
+
+echo "[agent] checking $AGENT_TAG"
+docker run --rm "$AGENT_TAG" sh -lc '
+  test -d /app/vllm/.git
+  git -C /app/vllm rev-parse HEAD >/dev/null
+  test -f /opt/serving_eval/__init__.py
+  python3 -c "import sys; sys.path.insert(0, \"/opt\"); import serving_eval; print(serving_eval.__version__)"
+  python3 -c "import modal, openai, datasets"
+'
+
+echo "[judge] checking $JUDGE_TAG"
+docker run --rm "$JUDGE_TAG" sh -lc '
+  test -d /opt/vllm-clean/.git
+  git -C /opt/vllm-clean rev-parse HEAD >/dev/null
+  test -f /opt/serving_eval/__init__.py
+  python3 -c "import sys; sys.path.insert(0, \"/opt\"); import serving_eval; print(serving_eval.__version__)"
+  python3 -c "import modal, openai, datasets, swebench"
+  command -v docker >/dev/null
+'
diff --git a/2.0/problems/vllm_llm_serving_optimization/evaluate.sh b/2.0/problems/vllm_llm_serving_optimization/evaluate.sh
new file mode 100755
index 00000000..6f518849
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/evaluate.sh
@@ -0,0 +1,16 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+
+if [[ $# -gt 0 ]]; then
+  exec python3 "$SCRIPT_DIR/evaluator.py" "$@"
+fi
+
+SOLUTION="/work/execution_env/solution_env/solution.patch"
+if [[ ! -f "$SOLUTION" ]]; then
+  echo "Error: Missing $SOLUTION" >&2
+  exit 1
+fi
+
+python3 "$SCRIPT_DIR/evaluator.py" "$SOLUTION"
diff --git a/2.0/problems/vllm_llm_serving_optimization/evaluator.py b/2.0/problems/vllm_llm_serving_optimization/evaluator.py
new file mode 100644
index 00000000..4efdd4d3
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/evaluator.py
@@ -0,0 +1,508 @@
+"""Evaluator for the experimental vLLM LLM-serving latency optimization task.
+
+The agent submits a Python-only patch against a clean upstream vLLM v0.11.0
+checkout. The judge applies the patch, builds and serves the patched vLLM on a
+Modal L40S (``meta-llama/Llama-3.1-8B-Instruct``), runs a mini-swe-agent
+SWE-bench workload, and scores latency speedup vs a vanilla-vLLM baseline gated
+by an accuracy guardrail.
+
+This file contains the self-contained static patch policy and the scoring math.
+The heavy orchestration (Modal deploy, workload run, baseline caching) lives in
+the ``serving_eval`` package that is baked into the judge image. When that
+harness or the serving credentials are not available (for example a local CI
+smoke test), the evaluator validates the patch policy and returns a smoke score
+so the empty reference patch still passes.
+"""
+
+from __future__ import annotations
+
+import fnmatch
+import hashlib
+import json
+import math
+import os
+import re
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+MAX_PATCH_BYTES = 1_500_000
+MAX_CHANGED_FILES = 80
+TASK_CONFIG_PATH = Path("/judge/task_config.json")
+SERVING_EVAL_ROOT = "/opt"
+DEFAULT_CLEAN_SOURCE = Path("/opt/vllm-clean")
+
+
+def _load_task_config() -> dict[str, Any]:
+    try:
+        payload = json.loads(TASK_CONFIG_PATH.read_text(encoding="utf-8"))
+    except Exception:
+        return {}
+    return payload if isinstance(payload, dict) else {}
+
+
+TASK_CONFIG = _load_task_config()
+EVALUATION_CONFIG = (
+    TASK_CONFIG.get("evaluation", {}) if isinstance(TASK_CONFIG.get("evaluation"), dict) else {}
+)
+
+
+def _config_value(name: str, default: Any) -> Any:
+    return EVALUATION_CONFIG.get(name, default)
+
+
+def _config_int(name: str, default: int) -> int:
+    try:
+        return int(EVALUATION_CONFIG.get(name, default))
+    except Exception:
+        return default
+
+
+def _config_float(name: str, default: float) -> float:
+    try:
+        return float(EVALUATION_CONFIG.get(name, default))
+    except Exception:
+        return default
+
+
+def _config_str(name: str, default: str) -> str:
+    raw = EVALUATION_CONFIG.get(name, default)
+    return str(raw)
+
+
+ACCURACY_TOLERANCE = _config_float("accuracy_tolerance", 0.05)
+BASELINE_CACHE_PATH = Path(_config_str("baseline_cache_path", "/opt/vllm-baseline/baseline_metrics.json"))
+
+# --------------------------------------------------------------------------- #
+# Patch policy
+# --------------------------------------------------------------------------- #
+
+STRONGLY_ALLOWED_PATTERNS = (
+    "vllm/v1/core/**",
+    "vllm/v1/core/sched/**",
+    "vllm/v1/core/kv_cache_utils.py",
+    "vllm/config/scheduler.py",
+    "vllm/config/cache.py",
+)
+
+CONDITIONALLY_ALLOWED_PATTERNS = (
+    "vllm/v1/worker/**",
+    "vllm/v1/engine/**",
+    "vllm/v1/executor/**",
+    "vllm/v1/request.py",
+    "vllm/v1/outputs.py",
+    "vllm/v1/serial_utils.py",
+    "vllm/entrypoints/openai/protocol.py",
+    "vllm/entrypoints/openai/serving_engine.py",
+    "vllm/entrypoints/openai/serving_chat.py",
+    "vllm/entrypoints/openai/serving_completion.py",
+    "vllm/sampling_params.py",
+)
+
+DENIED_PATTERNS = (
+    "csrc/**",
+    "cmake/**",
+    "CMakeLists.txt",
+    "setup.py",
+    "setup.cfg",
+    "pyproject.toml",
+    "requirements/**",
+    "requirements*.txt",
+    "tests/**",
+    "benchmarks/**",
+    "benchmark/**",
+    "docs/**",
+    "examples/**",
+    "tools/**",
+    ".buildkite/**",
+    ".github/**",
+    "docker/**",
+    "Dockerfile*",
+    ".dockerignore",
+    "vllm/model_executor/models/**",
+    "vllm/model_executor/model_loader/**",
+    "vllm/transformers_utils/**",
+    "vllm/lora/**",
+    "vllm/distributed/**",
+    "vllm/entrypoints/llm.py",
+    "vllm/entrypoints/api_server.py",
+    "vllm/entrypoints/cli/**",
+    "vllm/version.py",
+    "vllm/_version.py",
+)
+
+# Source-line tokens that are forbidden in added code: benchmark/dataset names
+# (anti hard-coding) and secret/judge environment access (anti exfiltration and
+# anti benchmark-detection).
+HARD_CODE_TOKENS = (
+    "swebench",
+    "swe-bench",
+    "swe_bench",
+    "sweb.eval",
+    "minisweagent",
+    "mini-swe",
+    "princeton-nlp",
+    "SWE-bench_Verified",
+)
+
+SECRET_TOKENS = (
+    "MODAL_TOKEN",
+    "MODAL_TOKEN_ID",
+    "MODAL_TOKEN_SECRET",
+    "HF_TOKEN",
+    "HUGGING_FACE_HUB_TOKEN",
+    "HUGGINGFACEHUB",
+    "FRONTIER_",
+    "HARBOR_",
+    "JUDGE_URL",
+    "RUN_OUTPUT_DIR",
+    "scheduler_timestamps",
+)
+
+HARD_CODE_TOKEN_RE = re.compile(
+    "|".join(re.escape(token) for token in HARD_CODE_TOKENS),
+    re.IGNORECASE,
+)
+SECRET_TOKEN_RE = re.compile("|".join(re.escape(token) for token in SECRET_TOKENS))
+
+# Patches must stay Python-only because the Modal build uses VLLM_USE_PRECOMPILED
+# (prebuilt CUDA kernels, Python layer rebuilt). New .py/.pyi files are allowed.
+ALLOWED_SOURCE_EXTENSIONS = (".py", ".pyi")
+
+
+@dataclass(frozen=True)
+class PatchFile:
+    old_path: str
+    new_path: str
+    added_lines: tuple[str, ...]
+    removed_lines: tuple[str, ...]
+
+    @property
+    def path(self) -> str:
+        return self.new_path if self.new_path != "/dev/null" else self.old_path
+
+
+def _match(path: str, patterns: tuple[str, ...]) -> bool:
+    return any(fnmatch.fnmatch(path, pattern) for pattern in patterns)
+
+
+def _is_allowed_source_path(path: str) -> bool:
+    return _match(path, STRONGLY_ALLOWED_PATTERNS) or _match(path, CONDITIONALLY_ALLOWED_PATTERNS)
+
+
+def _invalid(message: str, metrics: dict[str, Any] | None = None):
+    payload = metrics or {}
+    payload.setdefault("valid_patch", 0)
+    return 0.0, 0.0, message, payload
+
+
+def _parse_patch(text: str) -> list[PatchFile]:
+    files: list[PatchFile] = []
+    current_old = ""
+    current_new = ""
+    added: list[str] = []
+    removed: list[str] = []
+    in_file = False
+
+    for line in text.splitlines():
+        if line.startswith("diff --git "):
+            if in_file:
+                files.append(PatchFile(current_old, current_new, tuple(added), tuple(removed)))
+            in_file = True
+            current_old = ""
+            current_new = ""
+            added = []
+            removed = []
+            continue
+        if not in_file:
+            continue
+        if line.startswith("--- "):
+            current_old = line[4:].strip()
+            if current_old.startswith("a/"):
+                current_old = current_old[2:]
+            continue
+        if line.startswith("+++ "):
+            current_new = line[4:].strip()
+            if current_new.startswith("b/"):
+                current_new = current_new[2:]
+            continue
+        if line.startswith("+") and not line.startswith("+++ "):
+            added.append(line[1:])
+            continue
+        if line.startswith("-") and not line.startswith("--- "):
+            removed.append(line[1:])
+
+    if in_file:
+        files.append(PatchFile(current_old, current_new, tuple(added), tuple(removed)))
+    return files
+
+
+def _validate_patch_path(path: str, metrics: dict[str, Any]) -> tuple[bool, str]:
+    if not path or path == "/dev/null":
+        return True, ""
+    if path.startswith("/") or ".." in Path(path).parts:
+        return False, f"unsafe patch path: {path}"
+    if _match(path, DENIED_PATTERNS):
+        return False, f"changed file is outside task boundary: {path}"
+    if not _is_allowed_source_path(path):
+        return False, f"changed file is not allowlisted: {path}"
+    if not path.endswith(ALLOWED_SOURCE_EXTENSIONS):
+        return False, f"only Python source changes are allowed (VLLM_USE_PRECOMPILED build): {path}"
+    return True, ""
+
+
+def validate_patch(patch_path: Path) -> tuple[bool, str, dict[str, Any]]:
+    if not patch_path.exists():
+        return False, "solution patch does not exist", {}
+    size = patch_path.stat().st_size
+    if size > MAX_PATCH_BYTES:
+        return False, f"patch is too large ({size} bytes > {MAX_PATCH_BYTES})", {}
+    text = patch_path.read_text(encoding="utf-8", errors="replace")
+    patch_hash = hashlib.sha256(text.encode("utf-8", errors="replace")).hexdigest()
+    files = _parse_patch(text)
+    metrics: dict[str, Any] = {
+        "patch_bytes": size,
+        "patch_sha256": patch_hash,
+        "changed_files": len(files),
+    }
+    if len(files) > MAX_CHANGED_FILES:
+        return False, f"too many changed files ({len(files)} > {MAX_CHANGED_FILES})", metrics
+
+    for patch_file in files:
+        path = patch_file.path
+        if patch_file.new_path == "/dev/null":
+            return False, f"deleting source files is outside task boundary: {patch_file.old_path}", metrics
+        if patch_file.old_path != "/dev/null" and patch_file.old_path != patch_file.new_path:
+            ok, error = _validate_patch_path(patch_file.old_path, metrics)
+            if not ok:
+                return False, f"rename/copy source is outside task boundary: {error}", metrics
+
+        ok, error = _validate_patch_path(path, metrics)
+        if not ok:
+            return False, error, metrics
+        if not path or path == "/dev/null":
+            return False, "could not determine changed path from patch", metrics
+
+        added_text = "\n".join(patch_file.added_lines)
+        secret_match = SECRET_TOKEN_RE.search(added_text)
+        if secret_match:
+            return False, f"{path}: judge/secret environment access is forbidden ({secret_match.group(0)})", metrics
+        hard_code_match = HARD_CODE_TOKEN_RE.search(added_text)
+        if hard_code_match:
+            return False, f"{path}: benchmark-specific token is forbidden ({hard_code_match.group(0)})", metrics
+
+    metrics["valid_patch"] = 1
+    return True, "patch accepted by static policy", metrics
+
+
+def patch_is_empty(metrics: dict[str, Any]) -> bool:
+    return int(metrics.get("changed_files", 0)) == 0
+
+
+# --------------------------------------------------------------------------- #
+# Scoring
+# --------------------------------------------------------------------------- #
+
+def geometric_mean(values: list[float]) -> float:
+    if not values:
+        return 0.0
+    return math.exp(sum(math.log(max(value, 1e-9)) for value in values) / len(values))
+
+
+def score_from_speedup(speedup: float) -> float:
+    if speedup <= 0:
+        return 0.0
+    raw = 100.0 * math.log(speedup, 2)
+    return max(0.0, min(100.0, raw))
+
+
+def accuracy_multiplier(baseline_accuracy: float, patched_accuracy: float, tolerance: float) -> float:
+    base = max(baseline_accuracy, 1e-9)
+    rel_drop = max(0.0, (baseline_accuracy - patched_accuracy) / base)
+    if rel_drop <= tolerance:
+        return 1.0
+    return max(0.0, min(1.0, tolerance / rel_drop))
+
+
+def paired_speedups(
+    baseline_latency: dict[str, float],
+    patched_latency: dict[str, float],
+) -> list[float]:
+    speedups: list[float] = []
+    for instance_id, patched_value in patched_latency.items():
+        base_value = baseline_latency.get(instance_id)
+        if base_value is None or patched_value <= 0 or base_value <= 0:
+            continue
+        speedups.append(max(base_value / patched_value, 0.01))
+    return speedups
+
+
+# --------------------------------------------------------------------------- #
+# Error sanitization (black-box safety)
+# --------------------------------------------------------------------------- #
+
+def sanitize_error_text(text: str) -> str:
+    text = re.sub(r"/tmp/[A-Za-z0-9_./-]+", "<tmp>", text)
+    text = re.sub(r"https://[A-Za-z0-9_.:/-]+", "<url>", text)
+    for token in SECRET_TOKENS:
+        text = text.replace(token, "<redacted>")
+    text = re.sub(r"hf_[A-Za-z0-9]+", "<redacted>", text)
+    text = re.sub(r"ak-[A-Za-z0-9]+", "<redacted>", text)
+    return text[-800:]
+
+
+# --------------------------------------------------------------------------- #
+# Serving harness integration
+# --------------------------------------------------------------------------- #
+
+def _serving_harness_available() -> bool:
+    if not DEFAULT_CLEAN_SOURCE.exists():
+        return False
+    if not os.environ.get("MODAL_TOKEN_ID") or not os.environ.get("MODAL_TOKEN_SECRET"):
+        return False
+    return Path(SERVING_EVAL_ROOT, "serving_eval", "__init__.py").exists()
+
+
+def _load_serving_eval():
+    if SERVING_EVAL_ROOT not in sys.path:
+        sys.path.insert(0, SERVING_EVAL_ROOT)
+    import serving_eval  # type: ignore
+
+    return serving_eval
+
+
+def is_final_submission_role() -> bool:
+    return os.environ.get("FRONTIER_SUBMISSION_ROLE", "agent") == "final"
+
+
+def full_evaluation(patch_path: Path, metrics: dict[str, Any]):
+    final_role = is_final_submission_role()
+    metrics["submission_role"] = "final" if final_role else "agent"
+
+    if not _serving_harness_available():
+        # Local smoke / CI: the patch policy passed and the serving stack is not
+        # configured here. Return a positive smoke score so the reference patch
+        # and policy checks can be validated without a GPU or Modal.
+        metrics["full_benchmark"] = 0
+        metrics["serving_harness"] = "unconfigured"
+        return (
+            1.0,
+            1.0,
+            "patch policy smoke passed; vLLM serving harness is not configured in this environment",
+            metrics,
+        )
+
+    serving_eval = _load_serving_eval()
+    measurement = serving_eval.run_measurement(
+        patch_path=str(patch_path),
+        role="final" if final_role else "agent",
+        config=EVALUATION_CONFIG,
+        clean_source=str(DEFAULT_CLEAN_SOURCE),
+        baseline_cache_path=str(BASELINE_CACHE_PATH),
+    )
+
+    public_info = measurement.get("info", {}) if isinstance(measurement.get("info"), dict) else {}
+    for key, value in public_info.items():
+        if isinstance(value, (int, float, str, bool)):
+            metrics[f"info_{key}"] = value
+
+    if not measurement.get("ok", False):
+        gate = str(measurement.get("gate") or "serving evaluation gate failed")
+        metrics["gate"] = gate
+        return _invalid(gate, metrics)
+
+    if not measurement.get("correctness_ok", False):
+        metrics["gate"] = "correctness"
+        return _invalid("patched server generations differ from the baseline at temperature 0", metrics)
+
+    patched = measurement.get("patched", {}) or {}
+    baseline = measurement.get("baseline", {}) or {}
+    patched_latency = {str(k): float(v) for k, v in (patched.get("per_instance_latency") or {}).items()}
+    baseline_latency = {str(k): float(v) for k, v in (baseline.get("per_instance_latency") or {}).items()}
+    speedups = paired_speedups(baseline_latency, patched_latency)
+    if not speedups:
+        return _invalid("no paired latency measurements were produced", metrics)
+
+    gm_speedup = geometric_mean(speedups)
+    latency_score = score_from_speedup(gm_speedup)
+
+    patched_accuracy = float(patched.get("accuracy", 0.0))
+    baseline_accuracy = float(baseline.get("accuracy", 0.0))
+    acc_mult = accuracy_multiplier(baseline_accuracy, patched_accuracy, ACCURACY_TOLERANCE)
+    bounded = max(0.0, min(100.0, latency_score * acc_mult))
+
+    metrics.update(
+        {
+            "full_benchmark": 1,
+            "serving_harness": "modal_l40s",
+            "instances_scored": len(speedups),
+            "latency_geomean_speedup": gm_speedup,
+            "latency_score": latency_score,
+            "baseline_accuracy": baseline_accuracy,
+            "patched_accuracy": patched_accuracy,
+            "accuracy_multiplier": acc_mult,
+        }
+    )
+    return (
+        bounded,
+        bounded,
+        (
+            f"latency geomean speedup {gm_speedup:.4f}x over baseline vLLM; "
+            f"accuracy {patched_accuracy:.4f} vs baseline {baseline_accuracy:.4f} "
+            f"(multiplier {acc_mult:.3f})"
+        ),
+        metrics,
+    )
+
+
+def evaluate(solution_path: str) -> tuple[float, float, str, dict[str, Any]]:
+    patch_path = Path(solution_path)
+    ok, message, metrics = validate_patch(patch_path)
+    if not ok:
+        return _invalid(message, metrics)
+    try:
+        return full_evaluation(patch_path, metrics)
+    except Exception as exc:  # noqa: BLE001 - black-box: never surface raw internals
+        metrics["error_type"] = type(exc).__name__
+        metrics["error_detail"] = sanitize_error_text(str(exc))
+        return _invalid("serving evaluation failed", metrics)
+
+
+def prepare() -> dict[str, Any]:
+    """Report judge readiness without leaking secret values."""
+    return {
+        "task": "vllm_llm_serving_optimization",
+        "serving_harness_available": _serving_harness_available(),
+        "clean_source_present": DEFAULT_CLEAN_SOURCE.exists(),
+        "modal_credentials_present": bool(
+            os.environ.get("MODAL_TOKEN_ID") and os.environ.get("MODAL_TOKEN_SECRET")
+        ),
+        "hf_credentials_present": bool(os.environ.get("HF_TOKEN")),
+        "baseline_cache_present": BASELINE_CACHE_PATH.exists(),
+        "accuracy_tolerance": ACCURACY_TOLERANCE,
+    }
+
+
+def main(argv: list[str]) -> int:
+    if len(argv) != 2:
+        print("Usage: evaluator.py SOLUTION_PATCH", file=sys.stderr)
+        return 2
+    score, score_unbounded, message, metrics = evaluate(argv[1])
+    print(
+        json.dumps(
+            {
+                "score": score,
+                "score_unbounded": score_unbounded,
+                "message": message,
+                "metrics": metrics,
+            },
+            indent=2,
+            sort_keys=True,
+        )
+    )
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main(sys.argv))
diff --git a/2.0/problems/vllm_llm_serving_optimization/harbor/app/README.md b/2.0/problems/vllm_llm_serving_optimization/harbor/app/README.md
new file mode 100644
index 00000000..f03b20dc
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/harbor/app/README.md
@@ -0,0 +1,68 @@
+# vLLM LLM-Serving Optimization Starter
+
+The workspace contains a clean upstream vLLM checkout at:
+
+```text
+/app/vllm
+```
+
+Modify vLLM source code to reduce end-to-end serving latency on the agentic
+SWE-bench workload while preserving the model's task-solving accuracy. Only
+Python-only changes in the allowlisted scheduler/execution/serving areas are
+valid (see the task statement for the exact patch policy). The model is served
+on a Modal L40S with `VLLM_USE_PRECOMPILED`, so CUDA/C++ kernel changes are out
+of scope.
+
+## Submit
+
+```bash
+bash /app/make_submission.sh
+bash /app/submit.sh
+```
+
+`make_submission.sh` stages your changes in `/app/vllm` and writes
+`/app/solution.patch`. `submit.sh` enqueues that patch for the same black-box
+judge used by the final verifier. Submissions are asynchronous — submit early,
+then keep iterating. Use `bash /app/submissions.sh` and
+`bash /app/wait_submission.sh <uuid>` to inspect judge results.
+
+## Public test (local, async, real metrics)
+
+Before (or instead of) submitting, evaluate your working tree yourself:
+
+```bash
+bash /app/public_test.sh launch        # deploys /app/vllm to a Modal L40S, async
+bash /app/public_test.sh status <id>   # latency + accuracy + provisional score
+bash /app/public_test.sh run           # synchronous variant
+```
+
+The public test deploys your patched vLLM to a Modal L40S, serves
+`meta-llama/Llama-3.1-8B-Instruct`, runs the **public instance subset** (a strict
+subset of the final eval set) under the same Poisson arrival workload the judge
+uses, and returns:
+
+- per-instance and mean end-to-end latency,
+- an accuracy signal (patch-validity rate during iterative feedback),
+- a provisional speedup and score versus the baseline (when a baseline cache is
+  available in the image).
+
+This is real serving feedback — latency and accuracy — not a build/compile flag.
+Drive your loop with it: edit vLLM, run the public test, read the returned
+latency/accuracy, adjust.
+
+## What the judge measures
+
+The judge applies `/app/solution.patch` to a clean pinned vLLM tree, builds and
+serves it the same way, runs the workload, and scores **latency speedup vs the
+baseline**, gated by an **accuracy guardrail**: accuracy within 5% of the
+baseline does not affect the score; beyond that the score decays
+inverse-proportionally with the accuracy drop. The patched server must also
+reproduce the baseline's greedy (temperature 0) generations before any timing is
+considered.
+
+## Credentials
+
+Serving the model requires `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET`, and an
+`HF_TOKEN` (gated Llama-3.1 access), provided to the workspace. Do not read,
+print, or exfiltrate them, and do not reference them from patched vLLM source —
+the patch policy rejects that.
diff --git a/2.0/problems/vllm_llm_serving_optimization/harbor/app/make_submission.sh b/2.0/problems/vllm_llm_serving_optimization/harbor/app/make_submission.sh
new file mode 100755
index 00000000..647e346d
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/harbor/app/make_submission.sh
@@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+VLLM_DIR="${VLLM_DIR:-/app/vllm}"
+OUT="${1:-/app/solution.patch}"
+
+if [[ ! -d "$VLLM_DIR/.git" ]]; then
+  echo "vLLM checkout not found at $VLLM_DIR" >&2
+  exit 2
+fi
+
+# Stage everything (including new .py files) and diff against the base commit so
+# the patch captures all source changes the judge will apply to a clean tree.
+git -C "$VLLM_DIR" add -A
+git -C "$VLLM_DIR" diff --cached --binary > "$OUT"
+bytes=$(wc -c < "$OUT" | tr -d ' ')
+echo "Wrote $OUT ($bytes bytes)"
diff --git a/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.py b/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.py
new file mode 100755
index 00000000..c6d15975
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.py
@@ -0,0 +1,134 @@
+#!/usr/bin/env python3
+"""Async public-test client for the vLLM serving optimization task.
+
+Deploys the current `/app/vllm` working tree to a Modal L40S, runs the public
+instance subset (a strict subset of the final eval set), and reports per-instance
+and aggregate end-to-end latency plus an accuracy signal versus the baseline.
+The returned feedback is the same kind the judge uses — never just a compile/OK
+flag — so it can drive the optimization loop.
+
+Usage:
+    python3 /app/public_test.py launch        # start async run -> prints run id
+    python3 /app/public_test.py status <id>   # poll for the result
+    python3 /app/public_test.py run           # run synchronously
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import subprocess
+import sys
+import uuid
+from pathlib import Path
+
+SERVING_EVAL_ROOT = "/opt"
+TASK_CONFIG_PATH = Path("/app/task_config.json")
+RESULTS_DIR = Path("/app/.public_test")
+VLLM_SRC = os.environ.get("VLLM_DIR", "/app/vllm")
+DEFAULT_BASELINE_CACHE = "/opt/vllm-baseline/baseline_metrics.json"
+
+
+def load_eval_config() -> dict:
+    try:
+        payload = json.loads(TASK_CONFIG_PATH.read_text(encoding="utf-8"))
+    except Exception:
+        return {}
+    evaluation = payload.get("evaluation") if isinstance(payload, dict) else {}
+    return evaluation if isinstance(evaluation, dict) else {}
+
+
+def baseline_cache_path(config: dict) -> str:
+    return str(config.get("baseline_cache_path", DEFAULT_BASELINE_CACHE))
+
+
+def run_sync() -> dict:
+    if SERVING_EVAL_ROOT not in sys.path:
+        sys.path.insert(0, SERVING_EVAL_ROOT)
+    if not os.environ.get("MODAL_TOKEN_ID") or not os.environ.get("MODAL_TOKEN_SECRET"):
+        return {"ok": False, "error": "Modal credentials are not configured (MODAL_TOKEN_ID/SECRET)"}
+    try:
+        import serving_eval  # type: ignore
+    except Exception as exc:  # noqa: BLE001
+        return {"ok": False, "error": f"serving harness unavailable: {type(exc).__name__}"}
+
+    config = load_eval_config()
+    try:
+        return serving_eval.run_public_test(
+            src=VLLM_SRC,
+            config=config,
+            baseline_cache_path=baseline_cache_path(config),
+        )
+    except Exception as exc:  # noqa: BLE001
+        return {"ok": False, "error": f"public test failed: {type(exc).__name__}"}
+
+
+def cmd_run() -> int:
+    print(json.dumps(run_sync(), indent=2, sort_keys=True))
+    return 0
+
+
+def cmd_launch() -> int:
+    RESULTS_DIR.mkdir(parents=True, exist_ok=True)
+    run_id = uuid.uuid4().hex[:12]
+    (RESULTS_DIR / f"{run_id}.json").write_text(json.dumps({"status": "running"}), encoding="utf-8")
+    subprocess.Popen(
+        [sys.executable, str(Path(__file__).resolve()), "_worker", run_id],
+        stdin=subprocess.DEVNULL,
+        stdout=subprocess.DEVNULL,
+        stderr=subprocess.DEVNULL,
+        start_new_session=True,
+    )
+    print(json.dumps({"status": "launched", "run_id": run_id}, indent=2))
+    print(f"poll with: bash /app/public_test.sh status {run_id}")
+    return 0
+
+
+def cmd_worker(run_id: str) -> int:
+    RESULTS_DIR.mkdir(parents=True, exist_ok=True)
+    out = RESULTS_DIR / f"{run_id}.json"
+    try:
+        result = run_sync()
+        out.write_text(json.dumps({"status": "done", "result": result}), encoding="utf-8")
+    except Exception as exc:  # noqa: BLE001
+        out.write_text(json.dumps({"status": "error", "error": type(exc).__name__}), encoding="utf-8")
+    return 0
+
+
+def cmd_status(run_id: str) -> int:
+    out = RESULTS_DIR / f"{run_id}.json"
+    if not out.exists():
+        print(json.dumps({"status": "unknown", "run_id": run_id}, indent=2))
+        return 0
+    try:
+        payload = json.loads(out.read_text(encoding="utf-8"))
+    except Exception:
+        payload = {"status": "running", "run_id": run_id}
+    print(json.dumps(payload, indent=2, sort_keys=True))
+    return 0
+
+
+def main(argv: list[str]) -> int:
+    if len(argv) < 2:
+        print(__doc__)
+        return 2
+    command = argv[1]
+    if command == "run":
+        return cmd_run()
+    if command == "launch":
+        return cmd_launch()
+    if command == "status":
+        if len(argv) < 3:
+            print("Usage: public_test.py status <run_id>", file=sys.stderr)
+            return 2
+        return cmd_status(argv[2])
+    if command == "_worker":
+        if len(argv) < 3:
+            return 2
+        return cmd_worker(argv[2])
+    print(f"unknown command: {command}", file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    raise SystemExit(main(sys.argv))
diff --git a/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.sh b/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.sh
new file mode 100755
index 00000000..b94eaf9e
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.sh
@@ -0,0 +1,10 @@
+#!/usr/bin/env bash
+# Async public-test client. Deploys the current /app/vllm working tree to a Modal
+# L40S, runs the public instance subset, and reports latency + accuracy feedback
+# (not merely whether the build succeeded).
+#
+#   bash /app/public_test.sh launch        # start an async run, prints a run id
+#   bash /app/public_test.sh status <id>   # poll for latency/accuracy result
+#   bash /app/public_test.sh run           # run synchronously and print result
+set -euo pipefail
+exec python3 /app/public_test.py "$@"
diff --git a/2.0/problems/vllm_llm_serving_optimization/harbor/app/solution.patch b/2.0/problems/vllm_llm_serving_optimization/harbor/app/solution.patch
new file mode 100644
index 00000000..e69de29b
diff --git a/2.0/problems/vllm_llm_serving_optimization/readme b/2.0/problems/vllm_llm_serving_optimization/readme
new file mode 100644
index 00000000..3bfa3b59
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/readme
@@ -0,0 +1,238 @@
+# vLLM LLM-Serving Latency Optimization
+
+## Problem
+
+This is an experimental systems task. You are given a pinned, clean checkout of
+[vLLM](https://github.com/vllm-project/vllm) in the Harbor workspace and may
+modify vLLM itself. Your goal is to reduce the **end-to-end latency** of an LLM
+serving system on a realistic multi-turn agentic workload while preserving the
+**accuracy** (task-solving quality) of the served model.
+
+The serving target is a single-GPU deployment of
+`meta-llama/Llama-3.1-8B-Instruct` running on one NVIDIA **L40S**, exposed
+through vLLM's OpenAI-compatible HTTP API. The workload is an agentic
+code-editing benchmark (see *Workload* below) whose requests are long,
+multi-turn conversations that arrive over time as a Poisson process.
+
+The intended optimization area is **online serving efficiency**: request
+scheduling, batching, KV-cache management, prefix/prompt cache reuse,
+preemption and admission control, queueing, and closely related
+scheduler/execution wiring. Strong submissions improve the workload's latency
+distribution without changing what the model actually generates and without
+hard-coding the benchmark, dataset, queries, or judge details.
+
+## Serving Stack (Modal + L40S)
+
+Both your local public test and the hidden judge serve the patched vLLM the
+same way:
+
+- A [Modal](https://modal.com/docs) app builds an image from **your patched
+  vLLM source** and serves `meta-llama/Llama-3.1-8B-Instruct` on one **L40S**
+  through the OpenAI-compatible endpoint (`<url>/v1`).
+- The image is built with `VLLM_USE_PRECOMPILED=1`, which reuses vLLM's
+  prebuilt CUDA kernels and rebuilds only the Python layer. **Your patch must
+  therefore be Python-only** — changes that require recompiling CUDA/C++
+  kernels are out of scope and rejected by the patch policy.
+- The serving runtime (model, GPU, tensor-parallel size, max model length,
+  dtype, and OpenAI server flags) is fixed and identical for the baseline and
+  your patched build. You may not change how the server is launched; you may
+  only change vLLM's internal behavior through allowlisted source files.
+
+Running the model requires Modal and Hugging Face credentials configured in the
+environment (`MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET`, and an `HF_TOKEN` with
+access to the gated Llama-3.1 weights). These are provided to the workspace and
+the judge; do not attempt to read, print, or exfiltrate them.
+
+## Workload
+
+The workload is a [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent)
+SWE-bench run: each benchmark instance is one agentic task in which the agent
+holds a multi-turn conversation with the served model, issuing shell commands
+in a sandboxed repository between turns. Every turn re-sends the growing
+conversation, so consecutive requests for the same task share a long common
+prefix. Instances arrive over time (Poisson arrivals), so many conversations
+are in flight at once and compete for GPU and KV-cache capacity.
+
+The dataset is the public `princeton-nlp/SWE-bench_Verified` set (split
+`test`), and the agent loop, step limit, and decoding settings
+(temperature `0`) are fixed. Treat this as a representative analytical serving
+workload, not a set of strings to recognize. The hidden judge may include
+additional non-public instance groups and may vary instance order, arrival
+timing, and the number of repetitions. Submissions should implement general
+serving optimizations rather than benchmark-specific special cases.
+
+## Submission
+
+The submitted artifact is a patch file:
+
+```text
+/app/solution.patch
+```
+
+The agent workspace contains a clean vLLM checkout at:
+
+```text
+/app/vllm
+```
+
+After modifying vLLM, generate and submit a patch:
+
+```bash
+bash /app/make_submission.sh
+bash /app/submit.sh
+```
+
+Submissions are asynchronous. Submit an initial small, plausible patch as soon
+as it is generated, then keep iterating while the judge works. The judge applies
+your patch to a clean pinned vLLM source tree, builds it on Modal, serves it,
+runs the workload, and scores latency and accuracy from the judge side.
+Submitted binaries, build artifacts, generated benchmark files, and local
+timing logs are ignored.
+
+## Public Test (async, latency + accuracy feedback)
+
+You can evaluate your current working tree yourself, without going through the
+judge queue, using the public test client:
+
+```bash
+# Launch an async public-test run (deploys your patched vLLM to Modal L40S,
+# runs the public instance subset, returns a run id):
+bash /app/public_test.sh launch
+
+# Poll for the result (latency + accuracy, not just whether it compiled):
+bash /app/public_test.sh status <run_id>
+
+# Or run synchronously:
+bash /app/public_test.sh run
+```
+
+The public test reports the **same kind of feedback the judge uses**: per-instance
+and aggregate end-to-end latency, an accuracy signal versus the baseline, and a
+provisional score — not merely whether the build succeeded. The public instance
+subset is a strict subset of the final evaluation set, so it is a fast, faithful
+proxy. Use it to drive your optimization loop: change vLLM, rerun the public
+test, read the returned latency/accuracy, and adjust.
+
+## Correctness
+
+Correctness is a gate. The patched server must produce the **same generations**
+as the baseline server on the evaluated workload at temperature `0`. Before any
+timing is considered, the judge runs a small greedy-decoding smoke set and
+requires the patched build's outputs to match the baseline token-for-token.
+Build failures, patch-policy violations, server start-up failures, generation
+mismatches, crashes, timeouts, and out-of-memory failures are penalized before
+performance is considered.
+
+During iterative asynchronous submissions, the judge keeps feedback focused on
+the public instance subset so you can submit early and continue working while
+evaluation runs. During final verification, the judge uses the broader hidden
+instance set and a stricter accuracy measurement.
+
+## Scoring
+
+Valid submissions are scored by **latency speedup relative to the baseline**
+(vanilla vLLM serving the same model on the same L40S, same workload, same
+arrival schedule, same resource limits), gated by an **accuracy guardrail**.
+
+Latency is the end-to-end completion time per benchmark instance (arrival of the
+instance's first request to completion of its last response), measured
+client-side. For each instance a per-instance speedup is computed against the
+baseline, and the primary objective is the **geometric mean** of those
+per-instance speedups:
+
+```text
+per_instance_speedup = baseline_latency[i] / patched_latency[i]
+latency_speedup      = geomean(per_instance_speedup)
+latency_score        = clip(100 * log2(latency_speedup), 0, 100)
+```
+
+A `1.0x` result earns `0` points and regressions also earn `0`; using the
+geometric mean means broad speedups across instances are preferred over a single
+large outlier.
+
+Accuracy is the workload's task-solving rate (SWE-bench resolve rate at final
+verification; a patch-validity proxy during iterative feedback). Let
+
+```text
+rel_drop = max(0, (baseline_accuracy - patched_accuracy) / baseline_accuracy)
+```
+
+If `rel_drop <= 0.05` (within 5% of the baseline) there is no penalty.
+Otherwise the score decays inverse-proportionally with the accuracy drop:
+
+```text
+accuracy_multiplier = 1.0                       if rel_drop <= 0.05
+accuracy_multiplier = 0.05 / rel_drop           otherwise
+final_score         = latency_score * accuracy_multiplier
+```
+
+So a fast build that meaningfully degrades task quality loses most of its score,
+while a build that keeps accuracy within 5% of the baseline is scored purely on
+its latency improvement. The raw latency speedup, accuracy, and the multiplier
+are reported in evaluator metrics.
+
+## Patch Policy
+
+The evaluator validates the patch before building. The policy is intentionally
+strict because this task is graded by hidden benchmarks.
+
+Allowed serving/scheduler/execution areas:
+
+```text
+vllm/v1/core/**
+vllm/v1/core/sched/**
+vllm/v1/core/kv_cache_utils.py
+vllm/config/scheduler.py
+vllm/config/cache.py
+```
+
+Conditionally allowed narrow wiring areas:
+
+```text
+vllm/v1/worker/**
+vllm/v1/engine/**
+vllm/v1/executor/**
+vllm/v1/request.py
+vllm/v1/outputs.py
+vllm/v1/serial_utils.py
+vllm/entrypoints/openai/protocol.py
+vllm/entrypoints/openai/serving_engine.py
+vllm/entrypoints/openai/serving_chat.py
+vllm/entrypoints/openai/serving_completion.py
+vllm/sampling_params.py
+```
+
+New Python files are allowed in these areas. The build uses `VLLM_USE_PRECOMPILED`,
+so no build-system, CUDA/C++, packaging, or dependency changes are permitted.
+
+Forbidden areas include CUDA/C++ kernels and build files (`csrc/**`, `cmake/**`,
+`CMakeLists.txt`, `setup.py`, `pyproject.toml`, `requirements/**`), tests,
+benchmarks, docs, examples, CI files, model definitions
+(`vllm/model_executor/models/**`), tokenizer/loader internals, the workload
+harness, and any timing or scoring code.
+
+Patches may not add reads or writes of judge, Modal, Hugging Face, Frontier, or
+Harbor environment variables, and may not hard-code the benchmark name, dataset
+name, instance identifiers, or judge paths in scheduler/execution code. The
+server is launched under a fixed configuration; patches that detect the
+benchmark, sleep, short-circuit generation, or otherwise special-case the
+evaluation are rejected.
+
+## Resource Budget
+
+The experimental Harbor budget is:
+
+```text
+agent/judge container vCPUs: 8
+agent/judge container memory: 32 GiB
+storage: 64 GiB
+served model: meta-llama/Llama-3.1-8B-Instruct
+serving GPU: 1x NVIDIA L40S (via Modal)
+build timeout: 7200 seconds
+per-instance timeout: 1200 seconds
+decoding: temperature 0, fixed max tokens
+```
+
+The judge builds and serves both baseline and patched vLLM under the same fixed
+Modal configuration and the same OpenAI server flags, then runs the workload
+under the same arrival schedule before measuring latency and accuracy.
diff --git a/2.0/problems/vllm_llm_serving_optimization/reference.patch b/2.0/problems/vllm_llm_serving_optimization/reference.patch
new file mode 100644
index 00000000..e69de29b
diff --git a/2.0/problems/vllm_llm_serving_optimization/reference.py b/2.0/problems/vllm_llm_serving_optimization/reference.py
new file mode 100644
index 00000000..b34d70eb
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/reference.py
@@ -0,0 +1,7 @@
+"""Reference placeholder for the experimental vLLM LLM-serving optimization task.
+
+The Harbor task submits /app/solution.patch. This Python file exists so the
+Frontier-CS 2.0 task layout remains conventional; the valid baseline patch is
+stored in reference.patch (an empty patch, i.e. unmodified vLLM, which is the
+serving baseline).
+"""
diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/__init__.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/__init__.py
new file mode 100644
index 00000000..4bf1a618
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/__init__.py
@@ -0,0 +1,13 @@
+"""vLLM serving evaluation harness (shared by the judge and the public test).
+
+Public API:
+    run_measurement(...)  -> dict   # judge-side: baseline vs patched, gated
+    run_public_test(...)  -> dict   # agent-side: serve working tree, feedback
+"""
+
+from __future__ import annotations
+
+from .measure import run_measurement, run_public_test
+
+__all__ = ["run_measurement", "run_public_test"]
+__version__ = "0.1.0"
diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/accuracy.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/accuracy.py
new file mode 100644
index 00000000..11a91adb
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/accuracy.py
@@ -0,0 +1,170 @@
+"""Accuracy signals for the workload.
+
+Two modes, both comparable across baseline and patched runs:
+
+* ``patch_validity`` (cheap, used for iterative public feedback): the fraction
+  of instances that produced a non-empty, syntactically valid unified diff and
+  reached a submit/limit-with-patch terminal state. For a pure serving/scheduler
+  optimization this should be identical to the baseline; it cheaply catches
+  patches that corrupt generation or truncate context.
+* ``resolve_rate`` (faithful, used for final verification): the SWE-bench
+  resolved fraction computed locally by the ``swebench`` harness (per-instance
+  Docker test execution). Falls back to ``patch_validity`` if the harness is
+  unavailable, flagging that the proxy was used.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import tempfile
+from pathlib import Path
+from typing import Any
+
+from .agent_runner import InstanceResult
+from .settings import EvalSettings
+
+_TERMINAL_WITH_WORK = {"submitted", "limit_with_patch"}
+
+
+def _looks_like_diff(patch: str) -> bool:
+    if not patch or not patch.strip():
+        return False
+    return ("diff --git" in patch) or ("--- " in patch and "+++ " in patch)
+
+
+def patch_validity_rate(results: list[InstanceResult]) -> float:
+    if not results:
+        return 0.0
+    valid = sum(
+        1
+        for result in results
+        if result.exit_status in _TERMINAL_WITH_WORK and _looks_like_diff(result.patch)
+    )
+    return valid / len(results)
+
+
+def build_predictions(results: list[InstanceResult], model: str) -> dict[str, dict[str, str]]:
+    return {
+        result.instance_id: {
+            "model_name_or_path": model,
+            "instance_id": result.instance_id,
+            "model_patch": result.patch or "",
+        }
+        for result in results
+    }
+
+
+def resolve_rate(
+    results: list[InstanceResult],
+    *,
+    settings: EvalSettings,
+    run_id: str,
+) -> tuple[float, bool]:
+    """Return (accuracy, proxy_used). proxy_used=True if harness unavailable."""
+    try:
+        import swebench  # noqa: F401
+    except Exception:
+        return patch_validity_rate(results), True
+
+    predictions = build_predictions(results, settings.model)
+    instance_ids = [r.instance_id for r in results]
+    with tempfile.TemporaryDirectory(prefix="vllm-serving-opt-acc-") as tmp:
+        preds_path = Path(tmp) / "preds.json"
+        preds_path.write_text(json.dumps(predictions), encoding="utf-8")
+        try:
+            import inspect
+
+            from swebench.harness.run_evaluation import main as run_eval_main  # type: ignore
+        except Exception:
+            return patch_validity_rate(results), True
+
+        # Pass values for every kwarg the installed harness accepts; swebench has
+        # added required params over releases (namespace/modal/rewrite_reports in
+        # 4.x). namespace pulls prebuilt eval images from Docker Hub (no local
+        # image build) and modal=False runs them via the local Docker daemon.
+        all_kwargs: dict[str, Any] = {
+            "dataset_name": settings.dataset,
+            "split": settings.dataset_split,
+            "instance_ids": instance_ids,
+            "predictions_path": str(preds_path),
+            "max_workers": max(1, settings.workers),
+            "run_id": run_id,
+            "timeout": settings.instance_timeout_seconds,
+            "cache_level": "env",
+            "clean": False,
+            "force_rebuild": False,
+            "open_file_limit": 4096,
+            "report_dir": tmp,
+            "namespace": settings.swebench_namespace,
+            "rewrite_reports": False,
+            "modal": False,
+            "instance_image_tag": "latest",
+            "env_image_tag": "latest",
+        }
+        try:
+            params = inspect.signature(run_eval_main).parameters
+        except (TypeError, ValueError):
+            return patch_validity_rate(results), True
+        call_kwargs = {k: v for k, v in all_kwargs.items() if k in params}
+        # If the harness declares a required param we do not recognise, fall back
+        # rather than risk a misleading score from a signature mismatch.
+        missing_required = [
+            name
+            for name, p in params.items()
+            if p.default is inspect._empty and name not in call_kwargs
+        ]
+        if missing_required:
+            return patch_validity_rate(results), True
+
+        # swebench writes its summary report (<model>.<run_id>.json) and logs to
+        # the process CWD, so run it with CWD pinned to our temp dir to collect
+        # everything in one place for _read_resolved_count.
+        prev_cwd = os.getcwd()
+        try:
+            os.chdir(tmp)
+            run_eval_main(**call_kwargs)
+        except Exception:
+            return patch_validity_rate(results), True
+        finally:
+            try:
+                os.chdir(prev_cwd)
+            except Exception:
+                pass
+
+        resolved = _read_resolved_count(Path(tmp), settings.model)
+        if resolved is None:
+            return patch_validity_rate(results), True
+        total = max(1, len(results))
+        return resolved / total, False
+
+
+def _read_resolved_count(report_dir: Path, model: str) -> int | None:
+    candidates = list(report_dir.glob("*.json")) + list(report_dir.glob("**/*report*.json"))
+    for path in candidates:
+        try:
+            payload = json.loads(path.read_text(encoding="utf-8"))
+        except Exception:
+            continue
+        if not isinstance(payload, dict):
+            continue
+        for key in ("resolved_instances", "resolved"):
+            value = payload.get(key)
+            if isinstance(value, int):
+                return value
+            if isinstance(value, list):
+                return len(value)
+    return None
+
+
+def compute_accuracy(
+    results: list[InstanceResult],
+    *,
+    settings: EvalSettings,
+    mode: str,
+    run_id: str,
+) -> dict[str, Any]:
+    if mode == "resolve_rate":
+        accuracy, proxy_used = resolve_rate(results, settings=settings, run_id=run_id)
+        return {"accuracy": accuracy, "mode": "resolve_rate", "proxy_used": proxy_used}
+    return {"accuracy": patch_validity_rate(results), "mode": "patch_validity", "proxy_used": False}
diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/agent_runner.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/agent_runner.py
new file mode 100644
index 00000000..a22e83f8
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/agent_runner.py
@@ -0,0 +1,217 @@
+"""A compact, mini-swe-agent-style agentic workload runner.
+
+This drives the served model through multi-turn, tool-using conversations over a
+slice of SWE-bench instances, exactly the long shared-prefix workload the task
+optimizes. It is intentionally a small, self-contained re-implementation of the
+mini-swe-agent loop so that per-instance end-to-end latency can be measured
+cleanly on the client side (no dependence on any server-side instrumentation).
+
+For each instance:
+  * messages start with a system prompt + the problem statement,
+  * the model replies with one ```bash``` action,
+  * the action runs in the instance sandbox, and its output is fed back,
+  * the loop ends when the model submits (a sentinel command) or hits a limit.
+
+Instances arrive over time as a Poisson process (``jps``) or under fixed
+concurrency (``workers``), so many conversations are in flight at once.
+"""
+
+from __future__ import annotations
+
+import re
+import threading
+import time
+from concurrent.futures import ThreadPoolExecutor
+from dataclasses import dataclass, field
+from typing import Any
+
+from .settings import EvalSettings, parse_slice
+from .sandbox import make_sandbox
+
+SUBMIT_SENTINEL = "VLLM_SERVING_OPT_SUBMIT"
+
+SYSTEM_PROMPT = (
+    "You are a software engineering agent fixing a bug in a code repository.\n"
+    "Your working directory is the repository root.\n"
+    "At each step, reply with exactly ONE shell command inside a single fenced\n"
+    "```bash\n...\n``` block. Do not include any other text.\n"
+    "Inspect files, make edits, and run any checks you need.\n"
+    f"When the fix is complete, run: echo {SUBMIT_SENTINEL}\n"
+    "and nothing else, to submit your changes."
+)
+
+INSTANCE_TEMPLATE = (
+    "Resolve the following issue in the repository.\n\n"
+    "<issue>\n{problem_statement}\n</issue>\n\n"
+    "Begin by exploring the repository. Respond with one ```bash``` command."
+)
+
+BASH_BLOCK_RE = re.compile(r"```bash\s*\n(.*?)\n```", re.DOTALL)
+
+
+@dataclass
+class InstanceResult:
+    instance_id: str
+    latency_seconds: float
+    n_calls: int
+    exit_status: str
+    patch: str = ""
+    error: str = ""
+    per_call_seconds: list[float] = field(default_factory=list)
+
+
+def load_instances(settings: EvalSettings, role: str) -> list[dict[str, Any]]:
+    from datasets import load_dataset
+
+    dataset = load_dataset(settings.dataset, split=settings.dataset_split)
+    ids = sorted(range(len(dataset)), key=lambda i: dataset[i]["instance_id"])
+    chosen = list(parse_slice(settings.slice_for_role(role), len(ids)))
+    instances: list[dict[str, Any]] = []
+    for index in chosen:
+        row = dataset[ids[index]]
+        instances.append(
+            {
+                "instance_id": row["instance_id"],
+                "problem_statement": row.get("problem_statement", ""),
+            }
+        )
+    return instances
+
+
+def _openai_client(base_url: str):
+    from openai import OpenAI
+
+    return OpenAI(base_url=base_url, api_key="EMPTY", timeout=900.0)
+
+
+def _parse_action(text: str) -> str | None:
+    match = BASH_BLOCK_RE.search(text or "")
+    if not match:
+        return None
+    return match.group(1).strip()
+
+
+def run_instance(
+    instance: dict[str, Any],
+    *,
+    base_url: str,
+    settings: EvalSettings,
+    prefer_docker: bool,
+) -> InstanceResult:
+    instance_id = instance["instance_id"]
+    client = _openai_client(base_url)
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": INSTANCE_TEMPLATE.format(problem_statement=instance["problem_statement"])},
+    ]
+    per_call: list[float] = []
+    exit_status = "incomplete"
+    sandbox = None
+    started = time.perf_counter()
+    try:
+        sandbox = make_sandbox(instance_id, prefer_docker=prefer_docker)
+        for _ in range(max(1, settings.step_limit)):
+            call_start = time.perf_counter()
+            try:
+                completion = client.chat.completions.create(
+                    model=settings.model,
+                    messages=messages,
+                    temperature=settings.temperature,
+                    max_tokens=settings.max_completion_tokens,
+                )
+            except Exception as exc:  # noqa: BLE001
+                exit_status = "api_error"
+                return InstanceResult(
+                    instance_id=instance_id,
+                    latency_seconds=time.perf_counter() - started,
+                    n_calls=len(per_call),
+                    exit_status=exit_status,
+                    error=type(exc).__name__,
+                    per_call_seconds=per_call,
+                )
+            per_call.append(time.perf_counter() - call_start)
+            content = completion.choices[0].message.content or ""
+            messages.append({"role": "assistant", "content": content})
+
+            action = _parse_action(content)
+            if action is None:
+                messages.append(
+                    {
+                        "role": "user",
+                        "content": "Reply with exactly one ```bash``` command block.",
+                    }
+                )
+                continue
+            if SUBMIT_SENTINEL in action:
+                exit_status = "submitted"
+                break
+
+            code, output = sandbox.run(action, timeout=60)
+            output = output[-4000:]
+            messages.append(
+                {"role": "user", "content": f"(exit={code})\n{output}"}
+            )
+        patch = sandbox.read_patch() if sandbox is not None else ""
+        if exit_status == "incomplete" and patch.strip():
+            exit_status = "limit_with_patch"
+        return InstanceResult(
+            instance_id=instance_id,
+            latency_seconds=time.perf_counter() - started,
+            n_calls=len(per_call),
+            exit_status=exit_status,
+            patch=patch,
+            per_call_seconds=per_call,
+        )
+    finally:
+        if sandbox is not None:
+            sandbox.close()
+
+
+def run_workload(
+    *,
+    base_url: str,
+    settings: EvalSettings,
+    role: str,
+    prefer_docker: bool,
+) -> list[InstanceResult]:
+    instances = load_instances(settings, role)
+    results: list[InstanceResult] = []
+    results_lock = threading.Lock()
+
+    def _record(result: InstanceResult) -> None:
+        with results_lock:
+            results.append(result)
+
+    def _run(instance: dict[str, Any]) -> None:
+        _record(run_instance(instance, base_url=base_url, settings=settings, prefer_docker=prefer_docker))
+
+    if settings.arrival_mode == "jps" and settings.jps > 0:
+        # Deterministic Poisson schedule (seeded) so arrivals are reproducible.
+        import random
+
+        rng = random.Random(20260604)
+        schedule: list[float] = []
+        clock = 0.0
+        for _ in instances:
+            clock += rng.expovariate(settings.jps)
+            schedule.append(clock)
+        threads: list[threading.Thread] = []
+        origin = time.perf_counter()
+
+        def _delayed(instance: dict[str, Any], when: float) -> None:
+            delay = when - (time.perf_counter() - origin)
+            if delay > 0:
+                time.sleep(delay)
+            _run(instance)
+
+        for instance, when in zip(instances, schedule):
+            thread = threading.Thread(target=_delayed, args=(instance, when), daemon=True)
+            thread.start()
+            threads.append(thread)
+        for thread in threads:
+            thread.join()
+    else:
+        with ThreadPoolExecutor(max_workers=max(1, settings.workers)) as pool:
+            list(pool.map(_run, instances))
+
+    return results
diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/correctness.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/correctness.py
new file mode 100644
index 00000000..e122466b
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/correctness.py
@@ -0,0 +1,57 @@
+"""Greedy-decoding correctness gate.
+
+A serving/scheduler optimization must not change what the model generates. At
+temperature 0 the patched server must reproduce the baseline's outputs
+token-for-token on a small fixed prompt set. The baseline outputs are collected
+once (and cached alongside the baseline metrics); the patched outputs are then
+compared against that reference.
+
+The prompts are generic and benchmark-agnostic on purpose.
+"""
+
+from __future__ import annotations
+
+from typing import Any
+
+from .settings import EvalSettings
+
+SMOKE_PROMPTS = (
+    "Write a Python function that returns the n-th Fibonacci number.",
+    "Explain what a hash map is in two sentences.",
+    "Reverse the string 'serving' and return only the result.",
+    "What is the time complexity of binary search? Answer in one line.",
+    "Write a one-line shell command to count lines in a file named data.txt.",
+    "Summarize the difference between a list and a tuple in Python.",
+    "Given the list [3,1,2], return it sorted ascending.",
+    "Write a regular expression that matches an IPv4 address.",
+    "Convert the decimal number 42 to binary.",
+    "Name three common HTTP status codes and what they mean.",
+    "Write a SQL query selecting all rows from a table named users.",
+    "What does the 'git rebase' command do? One sentence.",
+)
+
+
+def collect_greedy_outputs(base_url: str, *, settings: EvalSettings, n: int) -> dict[str, str]:
+    from openai import OpenAI
+
+    client = OpenAI(base_url=base_url, api_key="EMPTY", timeout=300.0)
+    prompts = list(SMOKE_PROMPTS)[: max(1, n)]
+    outputs: dict[str, str] = {}
+    for prompt in prompts:
+        completion = client.chat.completions.create(
+            model=settings.model,
+            messages=[{"role": "user", "content": prompt}],
+            temperature=0.0,
+            max_tokens=256,
+            seed=0,
+        )
+        outputs[prompt] = completion.choices[0].message.content or ""
+    return outputs
+
+
+def compare_outputs(reference: dict[str, str], candidate: dict[str, str]) -> tuple[bool, int]:
+    mismatches = 0
+    for prompt, reference_text in reference.items():
+        if candidate.get(prompt, "") != reference_text:
+            mismatches += 1
+    return mismatches == 0, mismatches
diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/measure.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/measure.py
new file mode 100644
index 00000000..1287b2f4
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/measure.py
@@ -0,0 +1,310 @@
+"""Orchestration: build, serve, and measure baseline vs patched vLLM.
+
+To honour the one-L40S-per-environment budget, the baseline (vanilla vLLM) and
+the patched build are never served at the same time. The baseline is read from a
+cache baked into the judge image when available; otherwise it is measured once
+(serving the clean tree on its own) and cached. The patched build is then served
+on its own, gated on greedy-output correctness against the cached baseline
+outputs, and measured under the identical workload and arrival schedule.
+"""
+
+from __future__ import annotations
+
+import json
+import shutil
+import subprocess
+import tempfile
+import uuid
+from pathlib import Path
+from typing import Any
+
+from .accuracy import compute_accuracy
+from .agent_runner import InstanceResult, run_workload
+from .correctness import collect_greedy_outputs, compare_outputs
+from .scoring import provisional_score
+from .sandbox import docker_available
+from .serving import ServerHandle, ServingError, deploy_server, stop_server, wait_healthy
+from .settings import EvalSettings
+
+
+def _short_id() -> str:
+    return uuid.uuid4().hex[:10]
+
+
+def _latency_map(results: list[InstanceResult]) -> dict[str, float]:
+    return {result.instance_id: result.latency_seconds for result in results}
+
+
+def _apply_patch(clean_source: str, patch_path: str) -> Path:
+    tmp_root = Path(tempfile.mkdtemp(prefix="vllm-serving-opt-src-"))
+    patched = tmp_root / "vllm"
+    # Keep .git: vLLM's build uses setuptools_scm for versioning, and applying the
+    # patch to a tracked tree leaves the changes in the working tree (an editable
+    # install then picks them up). The patch is applied with `git apply` below.
+    shutil.copytree(clean_source, patched, dirs_exist_ok=False)
+    patch_text = Path(patch_path).read_text(encoding="utf-8", errors="replace")
+    if patch_text.strip():
+        check = subprocess.run(
+            ["git", "apply", "--check", patch_path],
+            cwd=str(patched),
+            capture_output=True,
+            text=True,
+        )
+        if check.returncode != 0:
+            shutil.rmtree(tmp_root, ignore_errors=True)
+            raise ServingError("patch does not apply cleanly to the pinned vLLM source")
+        subprocess.run(["git", "apply", patch_path], cwd=str(patched), check=True, capture_output=True, text=True)
+    return patched
+
+
+def _serve(src: str, settings: EvalSettings, *, app_name: str, label: str) -> ServerHandle:
+    handle = deploy_server(
+        src_path=src,
+        model=settings.model,
+        gpu=settings.gpu,
+        app_name=app_name,
+        label=label,
+        scaledown_seconds=settings.modal_scaledown_seconds,
+        startup_timeout_seconds=settings.modal_startup_timeout_seconds,
+        build_timeout_seconds=settings.build_timeout_seconds,
+        deploy_retries=settings.modal_deploy_retries,
+    )
+    wait_healthy(handle, model=settings.model, timeout_seconds=settings.server_health_timeout_seconds)
+    return handle
+
+
+def _measure_server(
+    handle: ServerHandle,
+    settings: EvalSettings,
+    *,
+    role: str,
+    accuracy_mode: str,
+    prefer_docker: bool,
+    greedy_n: int,
+    run_id: str,
+) -> dict[str, Any]:
+    greedy = collect_greedy_outputs(handle.base_url, settings=settings, n=greedy_n)
+    results = run_workload(
+        base_url=handle.base_url,
+        settings=settings,
+        role=role,
+        prefer_docker=prefer_docker,
+    )
+    accuracy = compute_accuracy(results, settings=settings, mode=accuracy_mode, run_id=run_id)
+    return {
+        "per_instance_latency": _latency_map(results),
+        "accuracy": float(accuracy["accuracy"]),
+        "accuracy_mode": accuracy["mode"],
+        "accuracy_proxy_used": bool(accuracy["proxy_used"]),
+        "greedy_outputs": greedy,
+        "n_instances": len(results),
+    }
+
+
+def _load_baseline_cache(path: str, role: str) -> dict[str, Any] | None:
+    cache_path = Path(path)
+    if not cache_path.exists():
+        return None
+    try:
+        payload = json.loads(cache_path.read_text(encoding="utf-8"))
+    except Exception:
+        return None
+    if not isinstance(payload, dict):
+        return None
+    entry = payload.get(role)
+    return entry if isinstance(entry, dict) else None
+
+
+def _store_baseline_cache(path: str, role: str, entry: dict[str, Any]) -> None:
+    cache_path = Path(path)
+    try:
+        cache_path.parent.mkdir(parents=True, exist_ok=True)
+        payload: dict[str, Any] = {}
+        if cache_path.exists():
+            existing = json.loads(cache_path.read_text(encoding="utf-8"))
+            if isinstance(existing, dict):
+                payload = existing
+        payload[role] = entry
+        cache_path.write_text(json.dumps(payload), encoding="utf-8")
+    except Exception:
+        pass
+
+
+def _get_baseline(
+    clean_source: str,
+    settings: EvalSettings,
+    *,
+    role: str,
+    accuracy_mode: str,
+    baseline_cache_path: str,
+    prefer_docker: bool,
+) -> dict[str, Any]:
+    cached = _load_baseline_cache(baseline_cache_path, role)
+    if cached and cached.get("per_instance_latency"):
+        return cached
+
+    app_name = f"vllm-serv-opt-base-{role}-{_short_id()}"
+    handle = _serve(clean_source, settings, app_name=app_name, label=app_name)
+    try:
+        measured = _measure_server(
+            handle,
+            settings,
+            role=role,
+            accuracy_mode=accuracy_mode,
+            prefer_docker=prefer_docker,
+            greedy_n=settings.correctness_smoke_prompts,
+            run_id=f"baseline-{role}-{_short_id()}",
+        )
+    finally:
+        stop_server(app_name)
+    _store_baseline_cache(baseline_cache_path, role, measured)
+    return measured
+
+
+def run_measurement(
+    *,
+    patch_path: str,
+    role: str,
+    config: dict[str, Any] | None,
+    clean_source: str,
+    baseline_cache_path: str,
+) -> dict[str, Any]:
+    settings = EvalSettings.from_config(config)
+    accuracy_mode = settings.accuracy_mode_for_role(role)
+    prefer_docker = (role == "final") and docker_available()
+    info: dict[str, Any] = {"role": role, "accuracy_mode": accuracy_mode, "prefer_docker": prefer_docker}
+
+    try:
+        baseline = _get_baseline(
+            clean_source,
+            settings,
+            role=role,
+            accuracy_mode=accuracy_mode,
+            baseline_cache_path=baseline_cache_path,
+            prefer_docker=prefer_docker,
+        )
+    except ServingError as exc:
+        return {"ok": False, "gate": f"baseline serving failed: {exc}", "info": info}
+
+    patched_src: Path | None = None
+    app_name = f"vllm-serv-opt-patch-{role}-{_short_id()}"
+    try:
+        patched_src = _apply_patch(clean_source, patch_path)
+    except ServingError as exc:
+        return {"ok": False, "gate": str(exc), "info": info}
+
+    handle: ServerHandle | None = None
+    try:
+        try:
+            handle = _serve(str(patched_src), settings, app_name=app_name, label=app_name)
+        except ServingError as exc:
+            return {"ok": False, "gate": f"patched build/serve failed: {exc}", "info": info}
+
+        patched_greedy = collect_greedy_outputs(
+            handle.base_url, settings=settings, n=settings.correctness_smoke_prompts
+        )
+        correctness_ok, mismatches = compare_outputs(
+            baseline.get("greedy_outputs", {}), patched_greedy
+        )
+        info["greedy_mismatches"] = mismatches
+        if not correctness_ok:
+            return {
+                "ok": True,
+                "correctness_ok": False,
+                "info": info,
+                "baseline": {
+                    "per_instance_latency": baseline.get("per_instance_latency", {}),
+                    "accuracy": baseline.get("accuracy", 0.0),
+                },
+                "patched": {"per_instance_latency": {}, "accuracy": 0.0},
+            }
+
+        results = run_workload(
+            base_url=handle.base_url,
+            settings=settings,
+            role=role,
+            prefer_docker=prefer_docker,
+        )
+        patched_accuracy = compute_accuracy(
+            results, settings=settings, mode=accuracy_mode, run_id=f"patched-{role}-{_short_id()}"
+        )
+        info["baseline_accuracy_mode"] = baseline.get("accuracy_mode")
+        info["patched_accuracy_proxy_used"] = bool(patched_accuracy["proxy_used"])
+        info["instances"] = len(results)
+        return {
+            "ok": True,
+            "correctness_ok": True,
+            "info": info,
+            "baseline": {
+                "per_instance_latency": baseline.get("per_instance_latency", {}),
+                "accuracy": float(baseline.get("accuracy", 0.0)),
+            },
+            "patched": {
+                "per_instance_latency": _latency_map(results),
+                "accuracy": float(patched_accuracy["accuracy"]),
+            },
+        }
+    finally:
+        if handle is not None:
+            stop_server(app_name)
+        if patched_src is not None:
+            shutil.rmtree(patched_src.parent, ignore_errors=True)
+
+
+def run_public_test(
+    *,
+    src: str,
+    config: dict[str, Any] | None,
+    baseline_cache_path: str,
+) -> dict[str, Any]:
+    """Agent-facing public test: serve the working tree and report feedback."""
+    settings = EvalSettings.from_config(config)
+    role = "agent"
+    accuracy_mode = settings.accuracy_mode_for_role(role)
+    prefer_docker = docker_available()
+    app_name = f"vllm-serv-opt-public-{_short_id()}"
+
+    handle: ServerHandle | None = None
+    try:
+        handle = _serve(src, settings, app_name=app_name, label=app_name)
+        measured = _measure_server(
+            handle,
+            settings,
+            role=role,
+            accuracy_mode=accuracy_mode,
+            prefer_docker=prefer_docker,
+            greedy_n=settings.correctness_smoke_prompts,
+            run_id=f"public-{_short_id()}",
+        )
+    except ServingError as exc:
+        return {"ok": False, "error": str(exc)}
+    finally:
+        if handle is not None:
+            stop_server(app_name)
+
+    patched_latency = measured["per_instance_latency"]
+    result: dict[str, Any] = {
+        "ok": True,
+        "n_instances": measured["n_instances"],
+        "accuracy": measured["accuracy"],
+        "accuracy_mode": measured["accuracy_mode"],
+        "mean_latency_seconds": (
+            sum(patched_latency.values()) / len(patched_latency) if patched_latency else 0.0
+        ),
+        "per_instance_latency": patched_latency,
+    }
+
+    baseline = _load_baseline_cache(baseline_cache_path, role)
+    if baseline and baseline.get("per_instance_latency"):
+        provisional = provisional_score(
+            {str(k): float(v) for k, v in baseline["per_instance_latency"].items()},
+            {str(k): float(v) for k, v in patched_latency.items()},
+            float(baseline.get("accuracy", 0.0)),
+            float(measured["accuracy"]),
+            settings.accuracy_tolerance,
+        )
+        result["baseline_accuracy"] = float(baseline.get("accuracy", 0.0))
+        result["provisional"] = provisional
+    else:
+        result["note"] = "no baseline cache present; reporting raw latency/accuracy only"
+    return result
diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/modal_app.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/modal_app.py
new file mode 100644
index 00000000..c5f8a6b9
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/modal_app.py
@@ -0,0 +1,134 @@
+"""Modal app that serves a (patched or vanilla) vLLM build on one L40S GPU.
+
+This module is deployed with ``modal deploy serving_eval/modal_app.py``. It is
+parametrized entirely through environment variables so the same module serves
+both the baseline (clean) and patched source trees, under distinct app names:
+
+    VLLM_SERVING_SRC   absolute path to the vLLM source tree to build from
+    VLLM_SERVING_MODEL HuggingFace model id to serve
+    VLLM_SERVING_GPU   Modal GPU string (default "L40S")
+    VLLM_SERVING_APP   Modal app name (must be unique per concurrent server)
+    VLLM_SERVING_LABEL deterministic web label for a predictable URL
+    VLLM_SERVING_SCALEDOWN  idle seconds before the GPU container is released
+    VLLM_SERVING_STARTUP    seconds Modal waits for the server port to open
+    VLLM_SERVING_MAXLEN     vLLM --max-model-len
+    VLLM_SERVING_HF_SECRET  name of the Modal Secret holding HF_TOKEN
+    VLLM_SERVING_VERSION    pinned vLLM version for setuptools_scm (default 0.11.0)
+    VLLM_SERVING_PRECOMPILED_WHEEL  ABI-matched precompiled wheel URL to overlay
+    VLLM_SERVING_TRANSFORMERS        pinned transformers requirement spec
+
+The build uses VLLM_USE_PRECOMPILED=1 so only vLLM's Python layer is rebuilt
+from source; the prebuilt CUDA kernels are reused. This keeps per-submission
+image builds to minutes and enforces the task's Python-only patch policy.
+"""
+
+from __future__ import annotations
+
+import os
+
+import modal
+
+VLLM_SERVING_SRC = os.environ.get("VLLM_SERVING_SRC", "/opt/vllm-clean")
+VLLM_SERVING_MODEL = os.environ.get("VLLM_SERVING_MODEL", "meta-llama/Llama-3.1-8B-Instruct")
+VLLM_SERVING_GPU = os.environ.get("VLLM_SERVING_GPU", "L40S")
+VLLM_SERVING_APP = os.environ.get("VLLM_SERVING_APP", "vllm-serving-opt")
+VLLM_SERVING_LABEL = os.environ.get("VLLM_SERVING_LABEL", VLLM_SERVING_APP)
+VLLM_SERVING_SCALEDOWN = int(os.environ.get("VLLM_SERVING_SCALEDOWN", "900"))
+VLLM_SERVING_STARTUP = int(os.environ.get("VLLM_SERVING_STARTUP", "1200"))
+VLLM_SERVING_MAXLEN = int(os.environ.get("VLLM_SERVING_MAXLEN", "16384"))
+VLLM_SERVING_HF_SECRET = os.environ.get("VLLM_SERVING_HF_SECRET", "huggingface-secret")
+# Pinned vLLM version. The source tree is copied into the build image, where its
+# git metadata is not reliably readable by setuptools_scm (and a patched tree is
+# "dirty" anyway), so the version is provided explicitly to make the editable
+# install deterministic and independent of git state.
+VLLM_SERVING_VERSION = os.environ.get("VLLM_SERVING_VERSION", "0.11.0")
+# ABI-matched precompiled wheel for the pinned version. With VLLM_USE_PRECOMPILED,
+# vLLM's build picks the wheel by deriving a base commit from git; for a shallow/
+# detached source tree that derivation fails and it falls back to a *nightly*
+# wheel, whose compiled extensions are ABI-incompatible with the pinned source
+# and abort at engine init (std::bad_alloc). Pin the matching release wheel so the
+# overlaid .so files match the source.
+VLLM_SERVING_PRECOMPILED_WHEEL = os.environ.get(
+    "VLLM_SERVING_PRECOMPILED_WHEEL",
+    "https://files.pythonhosted.org/packages/47/33/"
+    "d19e0763c34392ec956534536fa837c060495bfff31ed83452135ea7608d/"
+    "vllm-0.11.0-cp38-abi3-manylinux1_x86_64.whl",
+)
+# vLLM 0.11.0 only lower-bounds transformers (>=4.55.2); pin the CI-tested version
+# so the resolver does not pull transformers 5.x (incompatible tokenizer API).
+VLLM_SERVING_TRANSFORMERS = os.environ.get("VLLM_SERVING_TRANSFORMERS", "transformers==4.55.2")
+VLLM_PORT = 8000
+REMOTE_SRC = "/src/vllm"
+
+# Persisted caches so weights are downloaded once and reused across cold starts.
+hf_cache_vol = modal.Volume.from_name("vllm-serving-opt-hf-cache", create_if_missing=True)
+vllm_cache_vol = modal.Volume.from_name("vllm-serving-opt-vllm-cache", create_if_missing=True)
+
+serving_image = (
+    modal.Image.from_registry("nvidia/cuda:12.9.0-devel-ubuntu22.04", add_python="3.12")
+    .entrypoint([])
+    .apt_install("git", "build-essential")
+    .pip_install("uv")
+    # Bake the source tree into the image at build time. copy=True is required
+    # because the next build step (editable install) runs against these files.
+    .add_local_dir(VLLM_SERVING_SRC, REMOTE_SRC, copy=True)
+    .run_commands(
+        # SETUPTOOLS_SCM_PRETEND_VERSION* bypasses git-based version detection,
+        # which fails for the copied (and possibly patched/dirty) source tree.
+        # VLLM_PRECOMPILED_WHEEL_LOCATION pins the ABI-matched release wheel (the
+        # default nightly fallback aborts at engine init). transformers is pinned
+        # to the CI-tested version; hf_transfer backs HF_HUB_ENABLE_HF_TRANSFER.
+        f"cd {REMOTE_SRC} && "
+        f"SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM={VLLM_SERVING_VERSION} "
+        f"SETUPTOOLS_SCM_PRETEND_VERSION={VLLM_SERVING_VERSION} "
+        f"VLLM_PRECOMPILED_WHEEL_LOCATION={VLLM_SERVING_PRECOMPILED_WHEEL} "
+        f"VLLM_USE_PRECOMPILED=1 uv pip install --system -e . "
+        f"'{VLLM_SERVING_TRANSFORMERS}' hf_transfer",
+    )
+    .env(
+        {
+            "HF_HUB_ENABLE_HF_TRANSFER": "1",
+            "DO_NOT_TRACK": "1",
+        }
+    )
+)
+
+app = modal.App(VLLM_SERVING_APP)
+
+
+def _hf_secrets() -> list[modal.Secret]:
+    try:
+        return [modal.Secret.from_name(VLLM_SERVING_HF_SECRET)]
+    except Exception:
+        return []
+
+
+@app.function(
+    image=serving_image,
+    gpu=VLLM_SERVING_GPU,
+    scaledown_window=VLLM_SERVING_SCALEDOWN,
+    timeout=24 * 60 * 60,
+    secrets=_hf_secrets(),
+    volumes={
+        "/root/.cache/huggingface": hf_cache_vol,
+        "/root/.cache/vllm": vllm_cache_vol,
+    },
+)
+@modal.concurrent(max_inputs=64)
+@modal.web_server(port=VLLM_PORT, startup_timeout=VLLM_SERVING_STARTUP, label=VLLM_SERVING_LABEL)
+def serve() -> None:
+    import subprocess
+
+    cmd = [
+        "vllm",
+        "serve",
+        VLLM_SERVING_MODEL,
+        "--host",
+        "0.0.0.0",
+        "--port",
+        str(VLLM_PORT),
+        "--max-model-len",
+        str(VLLM_SERVING_MAXLEN),
+        "--disable-log-requests",
+    ]
+    subprocess.Popen(" ".join(cmd), shell=True)
diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/sandbox.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/sandbox.py
new file mode 100644
index 00000000..5e31f58c
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/sandbox.py
@@ -0,0 +1,141 @@
+"""Per-instance shell sandbox for the agentic workload.
+
+Each SWE-bench instance runs the agent's shell commands in an isolated sandbox
+rooted at the repository working directory. Two backends are supported:
+
+* ``docker``  – the SWE-bench per-instance testbed image
+  (``swebench/sweb.eval.x86_64.<instance>``) with the repo checked out at
+  ``/testbed``. This is the faithful backend used by the judge when a Docker
+  daemon is reachable; it is also what makes a real resolve-rate possible.
+* ``local``   – a lightweight temporary directory. Used for fast public-test
+  feedback (and CI) where Docker-in-Docker is unavailable. Commands run on the
+  host filesystem inside the temp dir.
+
+Both backends expose the same ``run(cmd) -> (exit_code, output)`` and
+``read_patch()`` interface so the agent loop is backend-agnostic.
+"""
+
+from __future__ import annotations
+
+import shutil
+import subprocess
+import tempfile
+import uuid
+from pathlib import Path
+
+
+def docker_available() -> bool:
+    if shutil.which("docker") is None:
+        return False
+    try:
+        subprocess.run(
+            ["docker", "info"],
+            check=True,
+            capture_output=True,
+            timeout=20,
+        )
+        return True
+    except Exception:
+        return False
+
+
+def swebench_image(instance_id: str) -> str:
+    key = instance_id.lower().replace("__", "_1776_")
+    return f"docker.io/swebench/sweb.eval.x86_64.{key}:latest"
+
+
+class Sandbox:
+    workdir = "/testbed"
+
+    def run(self, command: str, *, timeout: int) -> tuple[int, str]:
+        raise NotImplementedError
+
+    def read_patch(self) -> str:
+        raise NotImplementedError
+
+    def close(self) -> None:
+        pass
+
+
+class DockerSandbox(Sandbox):
+    def __init__(self, instance_id: str, *, command_timeout: int = 60) -> None:
+        self.instance_id = instance_id
+        self.command_timeout = command_timeout
+        self.container = f"vllm-serving-opt-{uuid.uuid4().hex[:12]}"
+        image = swebench_image(instance_id)
+        subprocess.run(
+            [
+                "docker",
+                "run",
+                "-d",
+                "--name",
+                self.container,
+                "--network",
+                "none",
+                "-w",
+                self.workdir,
+                image,
+                "sleep",
+                "infinity",
+            ],
+            check=True,
+            capture_output=True,
+            text=True,
+            timeout=600,
+        )
+
+    def run(self, command: str, *, timeout: int) -> tuple[int, str]:
+        try:
+            proc = subprocess.run(
+                ["docker", "exec", "-w", self.workdir, self.container, "bash", "-lc", command],
+                capture_output=True,
+                text=True,
+                timeout=timeout,
+            )
+        except subprocess.TimeoutExpired:
+            return 124, "command timed out"
+        return proc.returncode, (proc.stdout or "") + (proc.stderr or "")
+
+    def read_patch(self) -> str:
+        code, out = self.run("git add -A && git diff --cached", timeout=self.command_timeout)
+        return out if code == 0 else ""
+
+    def close(self) -> None:
+        subprocess.run(["docker", "rm", "-f", self.container], check=False, capture_output=True)
+
+
+class LocalSandbox(Sandbox):
+    def __init__(self, instance_id: str) -> None:
+        self.instance_id = instance_id
+        self._dir = tempfile.mkdtemp(prefix="vllm-serving-opt-sandbox-")
+        self.workdir = self._dir
+        subprocess.run(["git", "init", "-q"], cwd=self._dir, check=False, capture_output=True)
+
+    def run(self, command: str, *, timeout: int) -> tuple[int, str]:
+        try:
+            proc = subprocess.run(
+                ["bash", "-lc", command],
+                cwd=self._dir,
+                capture_output=True,
+                text=True,
+                timeout=timeout,
+            )
+        except subprocess.TimeoutExpired:
+            return 124, "command timed out"
+        return proc.returncode, (proc.stdout or "") + (proc.stderr or "")
+
+    def read_patch(self) -> str:
+        code, out = self.run("git add -A && git diff --cached", timeout=60)
+        return out if code == 0 else ""
+
+    def close(self) -> None:
+        shutil.rmtree(self._dir, ignore_errors=True)
+
+
+def make_sandbox(instance_id: str, *, prefer_docker: bool, command_timeout: int = 60) -> Sandbox:
+    if prefer_docker and docker_available():
+        try:
+            return DockerSandbox(instance_id, command_timeout=command_timeout)
+        except Exception:
+            pass
+    return LocalSandbox(instance_id)
diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/scoring.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/scoring.py
new file mode 100644
index 00000000..586fdcbd
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/scoring.py
@@ -0,0 +1,60 @@
+"""Scoring math shared with the agent-facing public test.
+
+The judge's authoritative scorer lives in the task evaluator; these helpers
+mirror it so the public test can show a provisional score consistent with how
+the judge will grade. Keep the two in sync.
+"""
+
+from __future__ import annotations
+
+import math
+
+
+def geometric_mean(values: list[float]) -> float:
+    if not values:
+        return 0.0
+    return math.exp(sum(math.log(max(value, 1e-9)) for value in values) / len(values))
+
+
+def paired_speedups(baseline: dict[str, float], patched: dict[str, float]) -> list[float]:
+    speedups: list[float] = []
+    for instance_id, patched_value in patched.items():
+        base_value = baseline.get(instance_id)
+        if base_value is None or patched_value <= 0 or base_value <= 0:
+            continue
+        speedups.append(max(base_value / patched_value, 0.01))
+    return speedups
+
+
+def score_from_speedup(speedup: float) -> float:
+    if speedup <= 0:
+        return 0.0
+    return max(0.0, min(100.0, 100.0 * math.log(speedup, 2)))
+
+
+def accuracy_multiplier(baseline_accuracy: float, patched_accuracy: float, tolerance: float) -> float:
+    base = max(baseline_accuracy, 1e-9)
+    rel_drop = max(0.0, (baseline_accuracy - patched_accuracy) / base)
+    if rel_drop <= tolerance:
+        return 1.0
+    return max(0.0, min(1.0, tolerance / rel_drop))
+
+
+def provisional_score(
+    baseline_latency: dict[str, float],
+    patched_latency: dict[str, float],
+    baseline_accuracy: float,
+    patched_accuracy: float,
+    tolerance: float,
+) -> dict[str, float]:
+    speedups = paired_speedups(baseline_latency, patched_latency)
+    gm = geometric_mean(speedups) if speedups else 0.0
+    latency_score = score_from_speedup(gm)
+    acc_mult = accuracy_multiplier(baseline_accuracy, patched_accuracy, tolerance)
+    return {
+        "latency_geomean_speedup": gm,
+        "latency_score": latency_score,
+        "accuracy_multiplier": acc_mult,
+        "score": max(0.0, min(100.0, latency_score * acc_mult)),
+        "instances_scored": float(len(speedups)),
+    }
diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/serving.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/serving.py
new file mode 100644
index 00000000..93d51f01
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/serving.py
@@ -0,0 +1,200 @@
+"""Deploy, health-check, and tear down a Modal-hosted vLLM server.
+
+The judge and the public test both build a Modal image from a vLLM source tree
+and serve it on an L40S. This module wraps that lifecycle:
+
+    deploy_server(...)  -> ServerHandle(base_url, app_name)
+    wait_healthy(...)
+    stop_server(...)
+
+Deployment shells out to the ``modal`` CLI in a fresh process whose environment
+selects the source tree / model / app name (see modal_app.py). The public URL
+is then resolved through the Modal SDK.
+"""
+
+from __future__ import annotations
+
+import os
+import subprocess
+import time
+import urllib.error
+import urllib.request
+from dataclasses import dataclass
+from pathlib import Path
+
+MODAL_APP_MODULE = str(Path(__file__).with_name("modal_app.py"))
+
+# Substrings that mark a transient Modal control-plane / build failure (image
+# build evicted, app stopped mid-deploy, gateway timeout) rather than a real
+# build error in the patched source. These are safe to retry.
+_TRANSIENT_MODAL_MARKERS = (
+    "external shut-down",
+    "terminated due to external",
+    "please try again",
+    "app_state_stopped",
+    "conflicterror",
+    "eat_timeout",
+    "deadline exceeded",
+    "connection reset",
+    "502 bad gateway",
+    "503 service",
+    "temporarily unavailable",
+    "timed out",
+)
+
+
+def _is_transient_modal_error(text: str) -> bool:
+    lowered = (text or "").lower()
+    return any(marker in lowered for marker in _TRANSIENT_MODAL_MARKERS)
+
+
+@dataclass
+class ServerHandle:
+    base_url: str  # OpenAI base, e.g. https://...modal.run/v1
+    app_name: str
+    label: str
+
+
+class ServingError(RuntimeError):
+    pass
+
+
+def _server_env(
+    *,
+    src_path: str,
+    model: str,
+    gpu: str,
+    app_name: str,
+    label: str,
+    scaledown_seconds: int,
+    startup_timeout_seconds: int,
+) -> dict[str, str]:
+    env = dict(os.environ)
+    env.update(
+        {
+            "VLLM_SERVING_SRC": src_path,
+            "VLLM_SERVING_MODEL": model,
+            "VLLM_SERVING_GPU": gpu,
+            "VLLM_SERVING_APP": app_name,
+            "VLLM_SERVING_LABEL": label,
+            "VLLM_SERVING_SCALEDOWN": str(scaledown_seconds),
+            "VLLM_SERVING_STARTUP": str(startup_timeout_seconds),
+        }
+    )
+    return env
+
+
+def _resolve_web_url(app_name: str) -> str:
+    import modal
+
+    fn = modal.Function.from_name(app_name, "serve")
+    url = fn.get_web_url()
+    if not url:
+        raise ServingError("deployed Modal function does not expose a web URL")
+    return url.rstrip("/")
+
+
+def deploy_server(
+    *,
+    src_path: str,
+    model: str,
+    gpu: str,
+    app_name: str,
+    label: str,
+    scaledown_seconds: int,
+    startup_timeout_seconds: int,
+    build_timeout_seconds: int,
+    deploy_retries: int = 3,
+) -> ServerHandle:
+    env = _server_env(
+        src_path=src_path,
+        model=model,
+        gpu=gpu,
+        app_name=app_name,
+        label=label,
+        scaledown_seconds=scaledown_seconds,
+        startup_timeout_seconds=startup_timeout_seconds,
+    )
+    # Modal's control plane / image builder occasionally evicts a build under load
+    # (concurrent deploys, transient gateway errors). Those failures are unrelated
+    # to the patched source, so retry them with a short backoff; a genuine build
+    # error in the patch is non-transient and fails fast.
+    attempts = max(1, deploy_retries)
+    last_error = "modal deploy failed"
+    for attempt in range(1, attempts + 1):
+        try:
+            subprocess.run(
+                ["modal", "deploy", MODAL_APP_MODULE],
+                env=env,
+                check=True,
+                capture_output=True,
+                text=True,
+                timeout=build_timeout_seconds,
+            )
+            base = _resolve_web_url(app_name)
+            return ServerHandle(base_url=f"{base}/v1", app_name=app_name, label=label)
+        except subprocess.TimeoutExpired:
+            last_error = "modal deploy timed out"
+            transient = True
+        except subprocess.CalledProcessError as exc:
+            # Surface only a short, sanitized tail; build logs may contain paths.
+            tail = (exc.stderr or exc.stdout or "")[-600:]
+            last_error = f"modal deploy failed: {tail}"
+            transient = _is_transient_modal_error(tail)
+        except ServingError as exc:
+            # _resolve_web_url failed (app not fully registered yet) — treat as transient.
+            last_error = str(exc)
+            transient = True
+
+        if not transient or attempt == attempts:
+            raise ServingError(last_error)
+
+        # Clear any half-created/stopped app state, then back off before retrying.
+        try:
+            subprocess.run(
+                ["modal", "app", "stop", app_name],
+                check=False,
+                capture_output=True,
+                text=True,
+                timeout=120,
+            )
+        except Exception:
+            pass
+        time.sleep(min(45, 10 * attempt))
+
+    raise ServingError(last_error)
+
+
+def wait_healthy(handle: ServerHandle, *, model: str, timeout_seconds: int) -> None:
+    """Block until the server answers /v1/models, or raise on timeout."""
+    deadline = time.time() + timeout_seconds
+    models_url = f"{handle.base_url}/models"
+    last_error: Exception | None = None
+    while time.time() < deadline:
+        try:
+            req = urllib.request.Request(models_url, headers={"Authorization": "Bearer EMPTY"})
+            with urllib.request.urlopen(req, timeout=10) as response:
+                if response.status == 200:
+                    return
+        except urllib.error.HTTPError as exc:
+            if exc.code in (401, 403):
+                return  # server is up; auth shape differs
+            last_error = exc
+        except Exception as exc:  # noqa: BLE001
+            last_error = exc
+        time.sleep(5)
+    raise ServingError(f"server did not become healthy within {timeout_seconds}s: {last_error}")
+
+
+def stop_server(app_name: str) -> None:
+    try:
+        subprocess.run(
+            ["modal", "app", "stop", app_name],
+            check=False,
+            capture_output=True,
+            text=True,
+            timeout=120,
+        )
+    except Exception:
+        # Best-effort teardown; idle containers also scale to zero on their own.
+        pass
diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/settings.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/settings.py
new file mode 100644
index 00000000..9cd45496
--- /dev/null
+++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/settings.py
@@ -0,0 +1,117 @@
+"""Configuration for the vLLM serving evaluation harness.
+
+A single :class:`EvalSettings` is built from the task ``evaluation`` config block
+(passed in from the evaluator) with environment-variable fallbacks. The same
+settings drive the judge-side measurement and the agent-side public test.
+"""
+
+from __future__ import annotations
+
+import os
+from dataclasses import dataclass, field
+from typing import Any
+
+
+def _as_int(value: Any, default: int) -> int:
+    try:
+        return int(value)
+    except Exception:
+        return default
+
+
+def _as_float(value: Any, default: float) -> float:
+    try:
+        return float(value)
+    except Exception:
+        return default
+
+
+@dataclass
+class EvalSettings:
+    model: str = "meta-llama/Llama-3.1-8B-Instruct"
+    gpu: str = "L40S"
+    dataset: str = "princeton-nlp/SWE-bench_Verified"
+    dataset_split: str = "test"
+    public_slice: str = "0:5"
+    eval_slice: str = "0:30"
+    arrival_mode: str = "jps"
+    jps: float = 0.5
+    workers: int = 8
+    step_limit: int = 50
+    temperature: float = 0.0
+    max_completion_tokens: int = 2048
+    accuracy_tolerance: float = 0.05
+    agent_accuracy_mode: str = "patch_validity"
+    final_accuracy_mode: str = "resolve_rate"
+    # Docker Hub namespace for prebuilt SWE-bench eval images (real resolve_rate).
+    swebench_namespace: str = "swebench"
+    correctness_smoke_prompts: int = 8
+    modal_scaledown_seconds: int = 900
+    modal_startup_timeout_seconds: int = 1200
+    modal_deploy_retries: int = 3
+    server_health_timeout_seconds: int = 1800
+    build_timeout_seconds: int = 5400
+    instance_timeout_seconds: int = 1200
+    extra: dict[str, Any] = field(default_factory=dict)
+
+    @classmethod
+    def from_config(cls, config: dict[str, Any] | None) -> "EvalSettings":
+        config = dict(config or {})
+        return cls(
+            model=str(config.get("model", cls.model)),
+            gpu=str(config.get("gpu", cls.gpu)),
+            dataset=str(config.get("dataset", cls.dataset)),
+            dataset_split=str(config.get("dataset_split", cls.dataset_split)),
+            public_slice=str(config.get("public_slice", cls.public_slice)),
+            eval_slice=str(config.get("eval_slice", cls.eval_slice)),
+            arrival_mode=str(config.get("arrival_mode", cls.arrival_mode)),
+            jps=_as_float(config.get("jps"), cls.jps),
+            workers=_as_int(config.get("workers"), cls.workers),
+            step_limit=_as_int(config.get("step_limit"), cls.step_limit),
+            temperature=_as_float(config.get("temperature"), cls.temperature),
+            max_completion_tokens=_as_int(config.get("max_completion_tokens"), cls.max_completion_tokens),
+            accuracy_tolerance=_as_float(config.get("accuracy_tolerance"), cls.accuracy_tolerance),
+            agent_accuracy_mode=str(config.get("agent_accuracy_mode", cls.agent_accuracy_mode)),
+            final_accuracy_mode=str(config.get("final_accuracy_mode", cls.final_accuracy_mode)),
+            swebench_namespace=str(config.get("swebench_namespace", cls.swebench_namespace)),
+            correctness_smoke_prompts=_as_int(
+                config.get("correctness_smoke_prompts"), cls.correctness_smoke_prompts
+            ),
+            modal_scaledown_seconds=_as_int(config.get("modal_scaledown_seconds"), cls.modal_scaledown_seconds),
+            modal_deploy_retries=_as_int(config.get("modal_deploy_retries"), cls.modal_deploy_retries),
+            modal_startup_timeout_seconds=_as_int(
+                config.get("modal_startup_timeout_seconds"), cls.modal_startup_timeout_seconds
+            ),
+            server_health_timeout_seconds=_as_int(
+                config.get("server_health_timeout_seconds"), cls.server_health_timeout_seconds
+            ),
+            build_timeout_seconds=_as_int(config.get("build_timeout_seconds"), cls.build_timeout_seconds),
+            instance_timeout_seconds=_as_int(config.get("instance_timeout_seconds"), cls.instance_timeout_seconds),
+            extra=config,
+        )
+
+    def slice_for_role(self, role: str) -> str:
+        return self.eval_slice if role == "final" else self.public_slice
+
+    def accuracy_mode_for_role(self, role: str) -> str:
+        return self.final_accuracy_mode if role == "final" else self.agent_accuracy_mode
+
+
+def parse_slice(spec: str, length: int) -> range:
+    """Parse a ``start:stop`` slice spec into a concrete index range."""
+    spec = (spec or "").strip()
+    if not spec:
+        return range(length)
+    parts = spec.split(":")
+    try:
+        start = int(parts[0]) if parts[0] else 0
+        stop = int(parts[1]) if len(parts) > 1 and parts[1] else length
+    except ValueError:
+        return range(length)
+    start = max(0, min(start, length))
+    stop = max(start, min(stop, length))
+    return range(start, stop)
+
+
+def modal_available() -> bool:
+    return bool(os.environ.get("MODAL_TOKEN_ID") and os.environ.get("MODAL_TOKEN_SECRET"))