diff --git a/2.0/README.md b/2.0/README.md index b414435d..ea2deeed 100644 --- a/2.0/README.md +++ b/2.0/README.md @@ -47,6 +47,21 @@ applies the submitted patch to a clean skeleton, runs a hidden arena against multiple baseline bot families, and scores by mean baseline win rate with a small faster-win tiebreak. The online generals.io service is not used. +## vLLM LLM-Serving Optimization + +This systems problem asks agents to patch a clean upstream vLLM checkout to +reduce the end-to-end latency of an LLM serving system on a multi-turn agentic +workload, while keeping accuracy near a baseline. Its problem ID is +`vllm_llm_serving_optimization`. The served model is +`meta-llama/Llama-3.1-8B-Instruct` on a single Modal L40S, and the workload is a +mini-swe-agent SWE-bench run. The agent submits a Python-only patch and can run +an async public test (a subset of the final eval set) that returns real latency +and accuracy feedback. Scoring is the geometric-mean latency speedup versus a +vanilla-vLLM baseline, gated by an accuracy guardrail: accuracy within 5% of the +baseline does not affect the score, and beyond that the score decays +inverse-proportionally with the accuracy drop. Like duckdb-e2e, the agent and +judge run in separate Docker environments. + ## BBOPlace ISPD2005 This VLSI placement problem asks agents to generate macro placement candidates diff --git a/2.0/problems/vllm_llm_serving_optimization/.dockerignore b/2.0/problems/vllm_llm_serving_optimization/.dockerignore new file mode 100644 index 00000000..eedfcafa --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/.dockerignore @@ -0,0 +1,8 @@ +**/__pycache__ +**/*.pyc +harbor/app/.public_test +docs +docker/README.md +*.md +reference.patch +harbor/app/solution.patch diff --git a/2.0/problems/vllm_llm_serving_optimization/DESIGN.md b/2.0/problems/vllm_llm_serving_optimization/DESIGN.md new file mode 100644 index 00000000..1711269c --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/DESIGN.md @@ -0,0 +1,250 @@ +# vLLM LLM-Serving Optimization — Design & Operations + +A Frontier-CS **2.0** systems task. The agent patches a **clean upstream vLLM +v0.11.0** checkout (Python-only) to reduce the **end-to-end latency** of an LLM +serving system on a multi-turn agentic workload, while keeping task-solving +**accuracy** close to a vanilla-vLLM baseline. The served model is +`meta-llama/Llama-3.1-8B-Instruct` on a single **NVIDIA L40S** provisioned +on-demand through [Modal](https://modal.com/docs). + +> **Validated end-to-end (2026-06-11):** a full Harbor trial with the `codex` +> agent (`gpt-5.5`) produced a real **1.79× latency geomean speedup** over the +> baseline at full eval scale (30 SWE-bench instances), accuracy preserved → +> **score 83.89 / 100**. + +--- + +## 1. Current Setting + +All knobs live in `config.yaml` (`evaluation` block) and are baked into the +judge/agent images as `task_config.json`. + +| Parameter | Value | Notes | +|---|---|---| +| Served model | `meta-llama/Llama-3.1-8B-Instruct` | gated; HF token required | +| Serving GPU | **1× NVIDIA L40S** (via Modal) | one GPU per environment | +| Workload | mini-swe-agent on `princeton-nlp/SWE-bench_Verified` (split `test`) | multi-turn, shared-prefix conversations | +| Arrival | Poisson, `jps = 0.5` jobs/s | concurrent in-flight conversations | +| `public_slice` (agent role) | `0:5` | iterative self-test subset | +| `eval_slice` (final role) | `0:30` | full verification; superset of public | +| Decoding | `temperature = 0`, `max_completion_tokens = 2048` | greedy, deterministic | +| `step_limit` | 50 | per-instance agent steps | +| Accuracy (agent role) | `patch_validity` | cheap proxy for iterative feedback | +| Accuracy (final role) | `resolve_rate` | **real** SWE-bench resolved fraction — judge mounts the host Docker socket (DooD) and runs the swebench harness against prebuilt testbed images; falls back to `patch_validity` only if no Docker daemon is reachable | +| `accuracy_tolerance` | `0.05` | ≤5% relative drop ⇒ no penalty | +| `correctness_smoke_prompts` | 8 | greedy outputs must match baseline token-for-token | +| Build timeout / per-instance timeout | 5400 s / 1200 s | | +| Submission | file `/app/solution.patch` (git diff vs `/app/vllm`), `max_queue_size = 2` | async | +| Container budget | 8 vCPU, 32 GiB RAM, 64 GiB storage | agent **and** judge; GPU is remote on Modal | + +**Two roles, two scales.** *Agent role* (iterative `submit.sh` / `public_test`) +uses `public_slice` + `patch_validity`; *final role* (the Harbor verifier) uses +`eval_slice` + `resolve_rate`. The public subset is a strict subset of the final +set, so the self-test is a fast, faithful proxy. + +--- + +## 2. Scoring + +The judge serves **baseline (vanilla vLLM)** and the **patched build** on the +same L40S, under the same workload and the same arrival schedule, and measures +per-instance end-to-end latency (arrival of an instance's first request → +completion of its last response), client-side. + +**Hard gates → score 0** (checked before any timing): +1. **Patch policy** (see §3) — disallowed file, non-Python, secret access, or + benchmark hard-coding. +2. **Build** — the patched source must build on Modal (`VLLM_USE_PRECOMPILED`). +3. **Server health** — `/v1/models` must come up. +4. **Correctness** — the patched server's greedy outputs must match the baseline + **token-for-token** at `temperature 0` on a small smoke set. An optimization + must not change what the model generates. + +**Latency score** (primary objective — geometric mean of per-instance speedups): +``` +per_instance_speedup[i] = baseline_latency[i] / patched_latency[i] # floored at 0.01 +latency_speedup = geomean(per_instance_speedup) +latency_score = clip(100 * log2(latency_speedup), 0, 100) +``` +`1.0×` → 0 points, `2.0×` → 100 points, regressions → 0. Geomean rewards broad +speedups over a single large outlier. + +**Accuracy guardrail** (multiplier): +``` +rel_drop = max(0, (baseline_accuracy - patched_accuracy) / baseline_accuracy) +acc_mult = 1.0 if rel_drop <= 0.05 # within 5% → no penalty +acc_mult = clip(0.05 / rel_drop, 0, 1) otherwise # inverse-proportional decay +``` + +**Final score**: +``` +score = clip(latency_score * acc_mult, 0, 100) +reward = score / 100 # Harbor reward.txt +``` +A fast build that degrades task quality loses most of its score; a build within +5% of baseline accuracy is scored purely on its latency improvement. + +Authoritative scorer: `evaluator.py` (`full_evaluation`); `serving_eval/scoring.py` +mirrors it for the agent-side public test's provisional score. When the serving +stack is unconfigured (no Modal/clean source, e.g. local CI), the evaluator +returns a `1.0` smoke score so the empty reference patch passes. + +--- + +## 3. Which vLLM files the model may change (Patch Policy) + +The patch is validated **before** building. Build uses `VLLM_USE_PRECOMPILED=1`, +so **only Python source is allowed** (`.py`, `.pyi`); no CUDA/C++, build-system, +packaging, or dependency changes. New Python files inside allowed areas are OK. + +**Strongly allowed** (core scheduling / batching / KV-cache): +``` +vllm/v1/core/** +vllm/v1/core/sched/** +vllm/v1/core/kv_cache_utils.py +vllm/config/scheduler.py +vllm/config/cache.py +``` + +**Conditionally allowed** (narrow wiring around the engine / request path): +``` +vllm/v1/worker/** vllm/v1/engine/** vllm/v1/executor/** +vllm/v1/request.py vllm/v1/outputs.py vllm/v1/serial_utils.py +vllm/entrypoints/openai/protocol.py +vllm/entrypoints/openai/serving_engine.py +vllm/entrypoints/openai/serving_chat.py +vllm/entrypoints/openai/serving_completion.py +vllm/sampling_params.py +``` + +**Denied** (rejected outright): +``` +csrc/** cmake/** CMakeLists.txt setup.py setup.cfg pyproject.toml +requirements/** requirements*.txt +tests/** benchmarks/** docs/** examples/** tools/** .github/** docker/** Dockerfile* +vllm/model_executor/models/** vllm/model_executor/model_loader/** +vllm/transformers_utils/** vllm/lora/** vllm/distributed/** +vllm/entrypoints/llm.py vllm/entrypoints/api_server.py vllm/entrypoints/cli/** +vllm/version.py vllm/_version.py +``` + +**Also rejected:** reading/writing judge/Modal/HF/Frontier/Harbor environment +variables (`MODAL_TOKEN*`, `HF_TOKEN`, `FRONTIER_*`, `HARBOR_*`, `JUDGE_URL`, +`RUN_OUTPUT_DIR`, scheduler-timestamp leakage), and hard-coding the benchmark / +dataset / instance ids / judge paths (`swebench`, `princeton-nlp`, +`SWE-bench_Verified`, `minisweagent`, …). The server is launched under a fixed +config; patches that detect the benchmark, sleep, short-circuit generation, or +otherwise special-case the evaluation are rejected. + +> **In practice:** the intended optimization area is *online serving efficiency* +> — request scheduling, batching, KV-cache management, prefix/prompt-cache reuse, +> preemption/admission control, queueing, and closely related scheduler/execution +> wiring. The validated 1.79× run was a single-file change to +> `vllm/v1/core/sched/scheduler.py`. (Candidate variants during the run also +> touched `vllm/v1/core/kv_cache_utils.py`, `vllm/v1/core/kv_cache_manager.py`, +> and `vllm/config/scheduler.py` — all within the allowlist.) + +--- + +## 4. GPU resource management & scheduling (Modal) + +**No local GPU.** The agent and judge containers are CPU-only clients +(8 vCPU / 32 GiB). The single L40S is provisioned **on-demand on Modal** and is +the *only* place the model runs. This is what makes the agent/judge split cheap +to host. + +### Image build (per submission) +`serving_eval/modal_app.py` defines a Modal app parametrized entirely via env +vars (so the same module serves baseline and patched trees): +- Base `nvidia/cuda:12.9.0-devel-ubuntu22.04` (+ Python 3.12, `uv`). +- `add_local_dir(, /src/vllm, copy=True)` bakes the **target source tree** + into the image (`copy=True` is required because the next step installs from it). +- `VLLM_USE_PRECOMPILED=1 uv pip install --system -e .` — reuses vLLM's prebuilt + CUDA kernels and rebuilds only the Python layer ⇒ per-submission builds are + minutes, not an hour, and the **Python-only patch policy is enforced by + construction**. +- Pinned for reproducibility on a shallow/patched tree: + `SETUPTOOLS_SCM_PRETEND_VERSION*` (version detection), a pinned + `VLLM_PRECOMPILED_WHEEL_LOCATION` (ABI-matched release wheel — the default + derivation falls back to an incompatible nightly), `transformers==4.55.2` + (the unpinned upper bound otherwise resolves to an incompatible 5.x), and + `hf_transfer`. + +### Serving +```python +@app.function(gpu="L40S", scaledown_window=900, secrets=[huggingface-secret], + volumes={hf_cache, vllm_cache}) +@modal.concurrent(max_inputs=64) +@modal.web_server(port=8000, startup_timeout=...) +def serve(): subprocess.Popen("vllm serve --host 0.0.0.0 --port 8000 ...") +``` +- `gpu="L40S"` requests exactly one L40S; `@modal.concurrent(64)` lets one + warm container handle many in-flight requests (matching the Poisson workload). +- `@modal.web_server` exposes vLLM's OpenAI endpoint at a stable + `https://…modal.run/v1`; Modal cold-starts the container on first request and + serves within `startup_timeout`. +- **Persisted caches:** a `huggingface` Volume (weights downloaded once, reused + across cold starts) and a `vllm` cache Volume. +- `scaledown_window=900` releases the idle GPU after 15 min — you pay for GPU + only while serving/measuring. + +### Lifecycle & scheduling (`serving_eval/serving.py`) +``` +deploy_server() → `modal deploy modal_app.py` (env selects src/model/app-name) + → Function.from_name(app, "serve").get_web_url() +wait_healthy() → poll /v1/models until 200 +... run workload ... +stop_server() → `modal app stop ` +``` +- **One L40S per environment is honored by serializing:** baseline and patched + are **never served concurrently**. The baseline is measured once and cached + (`/opt/vllm-baseline/baseline_metrics.json`); the patched build is then served + on its own and its greedy outputs are compared against the cached baseline. +- **Transient-failure retry:** Modal occasionally evicts an image build under + load (`Image build terminated due to external shut-down`, `APP_STATE_STOPPED`, + gateway timeouts). `deploy_server` retries such transient deploys with backoff + (`deploy_retries`, default 3), running `modal app stop` between attempts; a + genuine build error in the patch is non-transient and fails fast. +- Auth inside the containers is env-var based (`MODAL_TOKEN_ID` / + `MODAL_TOKEN_SECRET`); gated Llama weights are pulled inside the Modal serving + container via the Modal Secret `huggingface-secret` (key `HF_TOKEN`). + +### Where Modal is used from +Both the **agent's async public test** (`harbor/app/public_test.py` → +`serving_eval.run_public_test`) and the **judge's measurement** +(`evaluator.py` → `serving_eval.run_measurement`) drive Modal the same way, so +the iterative feedback the agent sees is the same kind the judge grades on. + +### Real resolve-rate (Docker-out-of-Docker) — separate from the GPU + +Accuracy is *task-solving* quality, not a GPU concern: the **CPU-side** SWE-bench +evaluation runs locally, not on Modal. For the final role the judge mounts the +**host Docker socket** (`/var/run/docker.sock`) so it can run two things against +real per-instance testbeds: +- the **workload sandbox** (`serving_eval/sandbox.py` `DockerSandbox`) — the + agent's shell commands execute inside `swebench/sweb.eval.x86_64.` + at `/testbed` (network-isolated), instead of the `LocalSandbox` fallback; +- the **resolve harness** (`serving_eval/accuracy.py` → `swebench.harness. + run_evaluation`, `namespace="swebench"`, `modal=False`) — pulls the prebuilt + eval image, applies the model's patch, runs the repo's `FAIL_TO_PASS` tests, + and reports the **resolved fraction** (`proxy_used=False`). + +These testbed containers run as **siblings on the host daemon**, fully separate +from the Modal L40S that serves the model. Cost note: each eval image is +~2–8 GB and a resolve takes ~2 min/instance, so a full `eval_slice 0:30` +resolve pulls ~100+ GB of images. Without the socket (e.g. local CI) the judge +auto-degrades to `patch_validity` and flags `proxy_used=True`. + +--- + +## File map + +``` +config.yaml resources, model, L40S, dataset, eval knobs (→ task_config.json) +readme public problem statement (no algorithm hints) +evaluator.py patch policy + scoring + orchestration (+ local smoke degrade) +serving_eval/ settings · modal_app · serving · sandbox · agent_runner · + accuracy · correctness · scoring · measure +docker/ agent + judge Dockerfiles, build/smoke scripts +harbor/app/ make_submission.sh, public_test client +``` diff --git a/2.0/problems/vllm_llm_serving_optimization/config.yaml b/2.0/problems/vllm_llm_serving_optimization/config.yaml new file mode 100644 index 00000000..88a65a10 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/config.yaml @@ -0,0 +1,79 @@ +tag: systems +runtime: + language: python + timeout_seconds: 21600 + environment: "Patched vLLM (v0.11.0) source; Modal L40S GPU serving Llama-3.1-8B-Instruct; mini-swe-agent SWE-bench workload; latency-primary judge with accuracy guardrail" + apt_packages: + - bash + - ca-certificates + - curl + - git + - python3 + - python3-pip + judge_apt_packages: + - bash + - ca-certificates + - curl + - git + - python3 + - python3-pip + judge_pip_packages: + - modal + - openai + - datasets + - huggingface-hub + docker: + # Experimental local images. Build them with + # 2.0/problems/vllm_llm_serving_optimization/docker/build_images.sh before running a + # local Harbor trial. Both images need a clean upstream vLLM v0.11.0 checkout + # (NOT the continuum fork). The judge image additionally vendors the + # mini-swe-agent harness and the latency/accuracy scorer. + image: frontiercs/vllm-serving-optimization-agent:experimental-v0.11.0 + judge_image: frontiercs/vllm-serving-optimization-judge:experimental-v0.11.0 +environment: + cpus: 8 + memory_mb: 32768 + storage_mb: 65536 + build_timeout_seconds: 7200 +evaluation: + # Model + accelerator served on Modal (one L40S per environment). + model: meta-llama/Llama-3.1-8B-Instruct + gpu: L40S + # Workload: mini-swe-agent on SWE-bench Verified (split test). + dataset: princeton-nlp/SWE-bench_Verified + dataset_split: test + # Iterative (agent-role) public test: a strict subset of the final eval set. + public_slice: "0:5" + # Final (verifier-role) evaluation: superset of the public slice. + eval_slice: "0:30" + # Poisson arrival workload (jobs/second). Mirrors a realistic serving load. + arrival_mode: jps + jps: 0.5 + workers: 8 + step_limit: 50 + temperature: 0.0 + max_completion_tokens: 2048 + # Latency aggregation + scoring. + latency_metric: mean_e2e_seconds + # Accuracy guardrail. Within `accuracy_tolerance` relative drop of baseline => + # no penalty; beyond it the score decays inverse-proportionally. + accuracy_tolerance: 0.05 + agent_accuracy_mode: patch_validity + final_accuracy_mode: resolve_rate + # Greedy-output correctness smoke (a handful of fixed prompts must match the + # baseline token-for-token at temperature 0 before timing is considered). + correctness_smoke_prompts: 8 + # Modal serving knobs. + modal_scaledown_seconds: 900 + modal_startup_timeout_seconds: 1200 + server_health_timeout_seconds: 1800 + # Per-phase wall-clock budgets (seconds). + build_timeout_seconds: 5400 + instance_timeout_seconds: 1200 + # Use a baseline (vanilla vLLM) cached in the judge image when available, + # otherwise the judge serves vanilla once and caches it for the trial. + baseline_cache_path: /opt/vllm-baseline/baseline_metrics.json +submission: + kind: file + path: /app/solution.patch + max_queue_size: 2 diff --git a/2.0/problems/vllm_llm_serving_optimization/docker/README.md b/2.0/problems/vllm_llm_serving_optimization/docker/README.md new file mode 100644 index 00000000..1de1534b --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/docker/README.md @@ -0,0 +1,74 @@ +# Experimental vLLM Serving-Optimization Images + +This task needs two images, mirroring the duckdb-e2e split: a public **agent** +image and a private **judge** image. Both bundle a clean upstream vLLM checkout +and the shared `serving_eval` harness. Build them before running a local Harbor +trial: + +```bash +bash 2.0/problems/vllm_llm_serving_optimization/docker/build_images.sh +``` + +Defaults: + +```text +VLLM_REF=v0.11.0 +AGENT_TAG=frontiercs/vllm-serving-optimization-agent:experimental-v0.11.0 +JUDGE_TAG=frontiercs/vllm-serving-optimization-judge:experimental-v0.11.0 +``` + +The agent image contains: + +```text +/app/vllm # clean upstream vLLM (no continuum, no reference fix) +/opt/serving_eval # shared harness, used by the async public test +/opt/vllm-baseline # optional precomputed baseline cache +``` + +The judge image contains: + +```text +/opt/vllm-clean # clean upstream vLLM (build + baseline reference) +/opt/serving_eval # shared harness, used by the evaluator +/opt/vllm-baseline # baseline-metrics cache (filled on first measurement) +``` + +## Runtime requirements (important) + +Unlike duckdb-e2e, this task does **not** run the model inside the container. +Both the agent public test and the judge serve `meta-llama/Llama-3.1-8B-Instruct` +on a **Modal L40S** built from the (patched) vLLM source. The containers +therefore need: + +- `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` in the environment (Modal auth). Use a + Modal service-user token for unattended runs. +- A Modal Secret named `huggingface-secret` containing `HF_TOKEN` with access to + the gated Llama-3.1 weights (`modal secret create huggingface-secret HF_TOKEN=...`). + The container also reads `HF_TOKEN` for the `datasets` download. +- The judge additionally needs a reachable Docker daemon (mounted socket or + DinD) to run the SWE-bench per-instance testbeds for the final resolve-rate. + When no daemon is reachable, the harness falls back to a local sandbox and the + patch-validity accuracy proxy. + +The Modal image build uses `VLLM_USE_PRECOMPILED=1`, so only vLLM's Python layer +is rebuilt from the submitted source (minutes, not a full CUDA compile). This is +why the patch policy is Python-only. + +## Baseline cache + +The judge measures the vanilla (clean-tree) baseline once per role and caches it +at `/opt/vllm-baseline/baseline_metrics.json`, keyed by role (`agent` / `final`). +Baseline and patched builds are never served simultaneously, so a single L40S is +sufficient per environment. To precompute and bake the baseline into the image +(recommended for faster trials), run the harness against the clean tree offline +and copy the resulting `baseline_metrics.json` into the image at that path. + +## Smoke test + +```bash +bash 2.0/problems/vllm_llm_serving_optimization/docker/smoke_images.sh +``` + +This checks that the clean vLLM checkout, the `serving_eval` package, and the +Modal/OpenAI/datasets (and, for the judge, swebench + docker) clients are +importable. It does not exercise Modal or a GPU. diff --git a/2.0/problems/vllm_llm_serving_optimization/docker/agent/Dockerfile b/2.0/problems/vllm_llm_serving_optimization/docker/agent/Dockerfile new file mode 100644 index 00000000..d06522f5 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/docker/agent/Dockerfile @@ -0,0 +1,49 @@ +# Agent base image for the vLLM LLM-serving optimization task. +# +# Provides a CLEAN upstream vLLM checkout (no continuum / no reference solution) +# for the agent to modify, the shared serving/eval harness used by the async +# public test, and the Modal + OpenAI + datasets clients. The Frontier-CS 2.0 +# adapter builds the final agent image ON TOP of this one and copies the harbor +# app scripts (make_submission.sh, public_test.sh, ...) into /app. +# +# Build context must be the task directory (so `serving_eval` is visible): +# docker build -f docker/agent/Dockerfile -t . +FROM ubuntu:24.04 + +ARG VLLM_REF=v0.11.0 +ARG DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && \ + apt-get install -y --no-install-recommends \ + bash \ + build-essential \ + ca-certificates \ + curl \ + git \ + python3 \ + python3-pip \ + python3-venv \ + ripgrep && \ + rm -rf /var/lib/apt/lists/* + +RUN pip3 install --break-system-packages --no-cache-dir \ + modal \ + openai \ + "datasets>=2.19" \ + huggingface-hub + +# Clean upstream vLLM source for the agent to modify. This is intentionally the +# vanilla vLLM project at a pinned tag: the optimization must be re-derived by +# the agent, not copied from any reference implementation. +RUN git clone --branch "${VLLM_REF}" --depth 1 https://github.com/vllm-project/vllm.git /app/vllm + +# Shared serving/eval harness (importable as `serving_eval` from /opt) used by +# the async public test client. +COPY serving_eval /opt/serving_eval + +# Optional baseline-metrics cache. When present (precomputed on the clean tree), +# the public test reports a provisional speedup/score; otherwise it reports raw +# latency and accuracy only. +RUN mkdir -p /opt/vllm-baseline + +WORKDIR /app diff --git a/2.0/problems/vllm_llm_serving_optimization/docker/build_images.sh b/2.0/problems/vllm_llm_serving_optimization/docker/build_images.sh new file mode 100755 index 00000000..a097e0e0 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/docker/build_images.sh @@ -0,0 +1,26 @@ +#!/usr/bin/env bash +set -euo pipefail + +SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +TASK_DIR=$(cd "$SCRIPT_DIR/.." && pwd) + +VLLM_REF="${VLLM_REF:-v0.11.0}" +AGENT_TAG="${AGENT_TAG:-frontiercs/vllm-serving-optimization-agent:experimental-v0.11.0}" +JUDGE_TAG="${JUDGE_TAG:-frontiercs/vllm-serving-optimization-judge:experimental-v0.11.0}" + +# Build context is the task directory so the Dockerfiles can COPY serving_eval. +docker build \ + --build-arg "VLLM_REF=$VLLM_REF" \ + -f "$TASK_DIR/docker/agent/Dockerfile" \ + -t "$AGENT_TAG" \ + "$TASK_DIR" + +docker build \ + --build-arg "VLLM_REF=$VLLM_REF" \ + -f "$TASK_DIR/docker/judge/Dockerfile" \ + -t "$JUDGE_TAG" \ + "$TASK_DIR" + +echo "Built:" +echo " $AGENT_TAG" +echo " $JUDGE_TAG" diff --git a/2.0/problems/vllm_llm_serving_optimization/docker/judge/Dockerfile b/2.0/problems/vllm_llm_serving_optimization/docker/judge/Dockerfile new file mode 100644 index 00000000..d7e702b7 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/docker/judge/Dockerfile @@ -0,0 +1,49 @@ +# Judge base image for the vLLM LLM-serving optimization task. +# +# Contains the clean upstream vLLM source (the build/baseline reference), the +# shared serving/eval harness, the SWE-bench resolve-rate harness, the Modal + +# OpenAI + datasets clients, and the Docker CLI for per-instance SWE-bench +# testbeds. The Frontier-CS 2.0 adapter builds the final judge image ON TOP of +# this one (copying judge_server.py + problem_evaluator.py into /judge). +# +# Build context must be the task directory (so `serving_eval` is visible): +# docker build -f docker/judge/Dockerfile -t . +FROM ubuntu:24.04 + +ARG VLLM_REF=v0.11.0 +ARG DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && \ + apt-get install -y --no-install-recommends \ + bash \ + build-essential \ + ca-certificates \ + curl \ + git \ + python3 \ + python3-pip \ + python3-venv \ + ripgrep \ + docker.io && \ + rm -rf /var/lib/apt/lists/* + +RUN pip3 install --break-system-packages --no-cache-dir \ + modal \ + openai \ + "datasets>=2.19" \ + huggingface-hub \ + "swebench>=3.0" + +# Clean upstream vLLM source: the build/serve baseline and the tree the patch is +# applied to. Must match the agent image's pinned tag. +RUN git clone --branch "${VLLM_REF}" --depth 1 https://github.com/vllm-project/vllm.git /opt/vllm-clean + +# Shared serving/eval harness (importable as `serving_eval` from /opt). +COPY serving_eval /opt/serving_eval + +# Baseline-metrics cache directory. The judge measures the vanilla baseline once +# (or reads a precomputed cache here) and never serves baseline + patched at the +# same time, honouring the one-L40S-per-environment budget. +RUN mkdir -p /opt/vllm-baseline + +WORKDIR /judge diff --git a/2.0/problems/vllm_llm_serving_optimization/docker/smoke_images.sh b/2.0/problems/vllm_llm_serving_optimization/docker/smoke_images.sh new file mode 100755 index 00000000..6718dbc0 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/docker/smoke_images.sh @@ -0,0 +1,24 @@ +#!/usr/bin/env bash +set -euo pipefail + +AGENT_TAG="${AGENT_TAG:-frontiercs/vllm-serving-optimization-agent:experimental-v0.11.0}" +JUDGE_TAG="${JUDGE_TAG:-frontiercs/vllm-serving-optimization-judge:experimental-v0.11.0}" + +echo "[agent] checking $AGENT_TAG" +docker run --rm "$AGENT_TAG" sh -lc ' + test -d /app/vllm/.git + git -C /app/vllm rev-parse HEAD >/dev/null + test -f /opt/serving_eval/__init__.py + python3 -c "import sys; sys.path.insert(0, \"/opt\"); import serving_eval; print(serving_eval.__version__)" + python3 -c "import modal, openai, datasets" +' + +echo "[judge] checking $JUDGE_TAG" +docker run --rm "$JUDGE_TAG" sh -lc ' + test -d /opt/vllm-clean/.git + git -C /opt/vllm-clean rev-parse HEAD >/dev/null + test -f /opt/serving_eval/__init__.py + python3 -c "import sys; sys.path.insert(0, \"/opt\"); import serving_eval; print(serving_eval.__version__)" + python3 -c "import modal, openai, datasets, swebench" + command -v docker >/dev/null +' diff --git a/2.0/problems/vllm_llm_serving_optimization/evaluate.sh b/2.0/problems/vllm_llm_serving_optimization/evaluate.sh new file mode 100755 index 00000000..6f518849 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/evaluate.sh @@ -0,0 +1,16 @@ +#!/usr/bin/env bash +set -euo pipefail + +SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) + +if [[ $# -gt 0 ]]; then + exec python3 "$SCRIPT_DIR/evaluator.py" "$@" +fi + +SOLUTION="/work/execution_env/solution_env/solution.patch" +if [[ ! -f "$SOLUTION" ]]; then + echo "Error: Missing $SOLUTION" >&2 + exit 1 +fi + +python3 "$SCRIPT_DIR/evaluator.py" "$SOLUTION" diff --git a/2.0/problems/vllm_llm_serving_optimization/evaluator.py b/2.0/problems/vllm_llm_serving_optimization/evaluator.py new file mode 100644 index 00000000..4efdd4d3 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/evaluator.py @@ -0,0 +1,508 @@ +"""Evaluator for the experimental vLLM LLM-serving latency optimization task. + +The agent submits a Python-only patch against a clean upstream vLLM v0.11.0 +checkout. The judge applies the patch, builds and serves the patched vLLM on a +Modal L40S (``meta-llama/Llama-3.1-8B-Instruct``), runs a mini-swe-agent +SWE-bench workload, and scores latency speedup vs a vanilla-vLLM baseline gated +by an accuracy guardrail. + +This file contains the self-contained static patch policy and the scoring math. +The heavy orchestration (Modal deploy, workload run, baseline caching) lives in +the ``serving_eval`` package that is baked into the judge image. When that +harness or the serving credentials are not available (for example a local CI +smoke test), the evaluator validates the patch policy and returns a smoke score +so the empty reference patch still passes. +""" + +from __future__ import annotations + +import fnmatch +import hashlib +import json +import math +import os +import re +import sys +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +MAX_PATCH_BYTES = 1_500_000 +MAX_CHANGED_FILES = 80 +TASK_CONFIG_PATH = Path("/judge/task_config.json") +SERVING_EVAL_ROOT = "/opt" +DEFAULT_CLEAN_SOURCE = Path("/opt/vllm-clean") + + +def _load_task_config() -> dict[str, Any]: + try: + payload = json.loads(TASK_CONFIG_PATH.read_text(encoding="utf-8")) + except Exception: + return {} + return payload if isinstance(payload, dict) else {} + + +TASK_CONFIG = _load_task_config() +EVALUATION_CONFIG = ( + TASK_CONFIG.get("evaluation", {}) if isinstance(TASK_CONFIG.get("evaluation"), dict) else {} +) + + +def _config_value(name: str, default: Any) -> Any: + return EVALUATION_CONFIG.get(name, default) + + +def _config_int(name: str, default: int) -> int: + try: + return int(EVALUATION_CONFIG.get(name, default)) + except Exception: + return default + + +def _config_float(name: str, default: float) -> float: + try: + return float(EVALUATION_CONFIG.get(name, default)) + except Exception: + return default + + +def _config_str(name: str, default: str) -> str: + raw = EVALUATION_CONFIG.get(name, default) + return str(raw) + + +ACCURACY_TOLERANCE = _config_float("accuracy_tolerance", 0.05) +BASELINE_CACHE_PATH = Path(_config_str("baseline_cache_path", "/opt/vllm-baseline/baseline_metrics.json")) + +# --------------------------------------------------------------------------- # +# Patch policy +# --------------------------------------------------------------------------- # + +STRONGLY_ALLOWED_PATTERNS = ( + "vllm/v1/core/**", + "vllm/v1/core/sched/**", + "vllm/v1/core/kv_cache_utils.py", + "vllm/config/scheduler.py", + "vllm/config/cache.py", +) + +CONDITIONALLY_ALLOWED_PATTERNS = ( + "vllm/v1/worker/**", + "vllm/v1/engine/**", + "vllm/v1/executor/**", + "vllm/v1/request.py", + "vllm/v1/outputs.py", + "vllm/v1/serial_utils.py", + "vllm/entrypoints/openai/protocol.py", + "vllm/entrypoints/openai/serving_engine.py", + "vllm/entrypoints/openai/serving_chat.py", + "vllm/entrypoints/openai/serving_completion.py", + "vllm/sampling_params.py", +) + +DENIED_PATTERNS = ( + "csrc/**", + "cmake/**", + "CMakeLists.txt", + "setup.py", + "setup.cfg", + "pyproject.toml", + "requirements/**", + "requirements*.txt", + "tests/**", + "benchmarks/**", + "benchmark/**", + "docs/**", + "examples/**", + "tools/**", + ".buildkite/**", + ".github/**", + "docker/**", + "Dockerfile*", + ".dockerignore", + "vllm/model_executor/models/**", + "vllm/model_executor/model_loader/**", + "vllm/transformers_utils/**", + "vllm/lora/**", + "vllm/distributed/**", + "vllm/entrypoints/llm.py", + "vllm/entrypoints/api_server.py", + "vllm/entrypoints/cli/**", + "vllm/version.py", + "vllm/_version.py", +) + +# Source-line tokens that are forbidden in added code: benchmark/dataset names +# (anti hard-coding) and secret/judge environment access (anti exfiltration and +# anti benchmark-detection). +HARD_CODE_TOKENS = ( + "swebench", + "swe-bench", + "swe_bench", + "sweb.eval", + "minisweagent", + "mini-swe", + "princeton-nlp", + "SWE-bench_Verified", +) + +SECRET_TOKENS = ( + "MODAL_TOKEN", + "MODAL_TOKEN_ID", + "MODAL_TOKEN_SECRET", + "HF_TOKEN", + "HUGGING_FACE_HUB_TOKEN", + "HUGGINGFACEHUB", + "FRONTIER_", + "HARBOR_", + "JUDGE_URL", + "RUN_OUTPUT_DIR", + "scheduler_timestamps", +) + +HARD_CODE_TOKEN_RE = re.compile( + "|".join(re.escape(token) for token in HARD_CODE_TOKENS), + re.IGNORECASE, +) +SECRET_TOKEN_RE = re.compile("|".join(re.escape(token) for token in SECRET_TOKENS)) + +# Patches must stay Python-only because the Modal build uses VLLM_USE_PRECOMPILED +# (prebuilt CUDA kernels, Python layer rebuilt). New .py/.pyi files are allowed. +ALLOWED_SOURCE_EXTENSIONS = (".py", ".pyi") + + +@dataclass(frozen=True) +class PatchFile: + old_path: str + new_path: str + added_lines: tuple[str, ...] + removed_lines: tuple[str, ...] + + @property + def path(self) -> str: + return self.new_path if self.new_path != "/dev/null" else self.old_path + + +def _match(path: str, patterns: tuple[str, ...]) -> bool: + return any(fnmatch.fnmatch(path, pattern) for pattern in patterns) + + +def _is_allowed_source_path(path: str) -> bool: + return _match(path, STRONGLY_ALLOWED_PATTERNS) or _match(path, CONDITIONALLY_ALLOWED_PATTERNS) + + +def _invalid(message: str, metrics: dict[str, Any] | None = None): + payload = metrics or {} + payload.setdefault("valid_patch", 0) + return 0.0, 0.0, message, payload + + +def _parse_patch(text: str) -> list[PatchFile]: + files: list[PatchFile] = [] + current_old = "" + current_new = "" + added: list[str] = [] + removed: list[str] = [] + in_file = False + + for line in text.splitlines(): + if line.startswith("diff --git "): + if in_file: + files.append(PatchFile(current_old, current_new, tuple(added), tuple(removed))) + in_file = True + current_old = "" + current_new = "" + added = [] + removed = [] + continue + if not in_file: + continue + if line.startswith("--- "): + current_old = line[4:].strip() + if current_old.startswith("a/"): + current_old = current_old[2:] + continue + if line.startswith("+++ "): + current_new = line[4:].strip() + if current_new.startswith("b/"): + current_new = current_new[2:] + continue + if line.startswith("+") and not line.startswith("+++ "): + added.append(line[1:]) + continue + if line.startswith("-") and not line.startswith("--- "): + removed.append(line[1:]) + + if in_file: + files.append(PatchFile(current_old, current_new, tuple(added), tuple(removed))) + return files + + +def _validate_patch_path(path: str, metrics: dict[str, Any]) -> tuple[bool, str]: + if not path or path == "/dev/null": + return True, "" + if path.startswith("/") or ".." in Path(path).parts: + return False, f"unsafe patch path: {path}" + if _match(path, DENIED_PATTERNS): + return False, f"changed file is outside task boundary: {path}" + if not _is_allowed_source_path(path): + return False, f"changed file is not allowlisted: {path}" + if not path.endswith(ALLOWED_SOURCE_EXTENSIONS): + return False, f"only Python source changes are allowed (VLLM_USE_PRECOMPILED build): {path}" + return True, "" + + +def validate_patch(patch_path: Path) -> tuple[bool, str, dict[str, Any]]: + if not patch_path.exists(): + return False, "solution patch does not exist", {} + size = patch_path.stat().st_size + if size > MAX_PATCH_BYTES: + return False, f"patch is too large ({size} bytes > {MAX_PATCH_BYTES})", {} + text = patch_path.read_text(encoding="utf-8", errors="replace") + patch_hash = hashlib.sha256(text.encode("utf-8", errors="replace")).hexdigest() + files = _parse_patch(text) + metrics: dict[str, Any] = { + "patch_bytes": size, + "patch_sha256": patch_hash, + "changed_files": len(files), + } + if len(files) > MAX_CHANGED_FILES: + return False, f"too many changed files ({len(files)} > {MAX_CHANGED_FILES})", metrics + + for patch_file in files: + path = patch_file.path + if patch_file.new_path == "/dev/null": + return False, f"deleting source files is outside task boundary: {patch_file.old_path}", metrics + if patch_file.old_path != "/dev/null" and patch_file.old_path != patch_file.new_path: + ok, error = _validate_patch_path(patch_file.old_path, metrics) + if not ok: + return False, f"rename/copy source is outside task boundary: {error}", metrics + + ok, error = _validate_patch_path(path, metrics) + if not ok: + return False, error, metrics + if not path or path == "/dev/null": + return False, "could not determine changed path from patch", metrics + + added_text = "\n".join(patch_file.added_lines) + secret_match = SECRET_TOKEN_RE.search(added_text) + if secret_match: + return False, f"{path}: judge/secret environment access is forbidden ({secret_match.group(0)})", metrics + hard_code_match = HARD_CODE_TOKEN_RE.search(added_text) + if hard_code_match: + return False, f"{path}: benchmark-specific token is forbidden ({hard_code_match.group(0)})", metrics + + metrics["valid_patch"] = 1 + return True, "patch accepted by static policy", metrics + + +def patch_is_empty(metrics: dict[str, Any]) -> bool: + return int(metrics.get("changed_files", 0)) == 0 + + +# --------------------------------------------------------------------------- # +# Scoring +# --------------------------------------------------------------------------- # + +def geometric_mean(values: list[float]) -> float: + if not values: + return 0.0 + return math.exp(sum(math.log(max(value, 1e-9)) for value in values) / len(values)) + + +def score_from_speedup(speedup: float) -> float: + if speedup <= 0: + return 0.0 + raw = 100.0 * math.log(speedup, 2) + return max(0.0, min(100.0, raw)) + + +def accuracy_multiplier(baseline_accuracy: float, patched_accuracy: float, tolerance: float) -> float: + base = max(baseline_accuracy, 1e-9) + rel_drop = max(0.0, (baseline_accuracy - patched_accuracy) / base) + if rel_drop <= tolerance: + return 1.0 + return max(0.0, min(1.0, tolerance / rel_drop)) + + +def paired_speedups( + baseline_latency: dict[str, float], + patched_latency: dict[str, float], +) -> list[float]: + speedups: list[float] = [] + for instance_id, patched_value in patched_latency.items(): + base_value = baseline_latency.get(instance_id) + if base_value is None or patched_value <= 0 or base_value <= 0: + continue + speedups.append(max(base_value / patched_value, 0.01)) + return speedups + + +# --------------------------------------------------------------------------- # +# Error sanitization (black-box safety) +# --------------------------------------------------------------------------- # + +def sanitize_error_text(text: str) -> str: + text = re.sub(r"/tmp/[A-Za-z0-9_./-]+", "", text) + text = re.sub(r"https://[A-Za-z0-9_.:/-]+", "", text) + for token in SECRET_TOKENS: + text = text.replace(token, "") + text = re.sub(r"hf_[A-Za-z0-9]+", "", text) + text = re.sub(r"ak-[A-Za-z0-9]+", "", text) + return text[-800:] + + +# --------------------------------------------------------------------------- # +# Serving harness integration +# --------------------------------------------------------------------------- # + +def _serving_harness_available() -> bool: + if not DEFAULT_CLEAN_SOURCE.exists(): + return False + if not os.environ.get("MODAL_TOKEN_ID") or not os.environ.get("MODAL_TOKEN_SECRET"): + return False + return Path(SERVING_EVAL_ROOT, "serving_eval", "__init__.py").exists() + + +def _load_serving_eval(): + if SERVING_EVAL_ROOT not in sys.path: + sys.path.insert(0, SERVING_EVAL_ROOT) + import serving_eval # type: ignore + + return serving_eval + + +def is_final_submission_role() -> bool: + return os.environ.get("FRONTIER_SUBMISSION_ROLE", "agent") == "final" + + +def full_evaluation(patch_path: Path, metrics: dict[str, Any]): + final_role = is_final_submission_role() + metrics["submission_role"] = "final" if final_role else "agent" + + if not _serving_harness_available(): + # Local smoke / CI: the patch policy passed and the serving stack is not + # configured here. Return a positive smoke score so the reference patch + # and policy checks can be validated without a GPU or Modal. + metrics["full_benchmark"] = 0 + metrics["serving_harness"] = "unconfigured" + return ( + 1.0, + 1.0, + "patch policy smoke passed; vLLM serving harness is not configured in this environment", + metrics, + ) + + serving_eval = _load_serving_eval() + measurement = serving_eval.run_measurement( + patch_path=str(patch_path), + role="final" if final_role else "agent", + config=EVALUATION_CONFIG, + clean_source=str(DEFAULT_CLEAN_SOURCE), + baseline_cache_path=str(BASELINE_CACHE_PATH), + ) + + public_info = measurement.get("info", {}) if isinstance(measurement.get("info"), dict) else {} + for key, value in public_info.items(): + if isinstance(value, (int, float, str, bool)): + metrics[f"info_{key}"] = value + + if not measurement.get("ok", False): + gate = str(measurement.get("gate") or "serving evaluation gate failed") + metrics["gate"] = gate + return _invalid(gate, metrics) + + if not measurement.get("correctness_ok", False): + metrics["gate"] = "correctness" + return _invalid("patched server generations differ from the baseline at temperature 0", metrics) + + patched = measurement.get("patched", {}) or {} + baseline = measurement.get("baseline", {}) or {} + patched_latency = {str(k): float(v) for k, v in (patched.get("per_instance_latency") or {}).items()} + baseline_latency = {str(k): float(v) for k, v in (baseline.get("per_instance_latency") or {}).items()} + speedups = paired_speedups(baseline_latency, patched_latency) + if not speedups: + return _invalid("no paired latency measurements were produced", metrics) + + gm_speedup = geometric_mean(speedups) + latency_score = score_from_speedup(gm_speedup) + + patched_accuracy = float(patched.get("accuracy", 0.0)) + baseline_accuracy = float(baseline.get("accuracy", 0.0)) + acc_mult = accuracy_multiplier(baseline_accuracy, patched_accuracy, ACCURACY_TOLERANCE) + bounded = max(0.0, min(100.0, latency_score * acc_mult)) + + metrics.update( + { + "full_benchmark": 1, + "serving_harness": "modal_l40s", + "instances_scored": len(speedups), + "latency_geomean_speedup": gm_speedup, + "latency_score": latency_score, + "baseline_accuracy": baseline_accuracy, + "patched_accuracy": patched_accuracy, + "accuracy_multiplier": acc_mult, + } + ) + return ( + bounded, + bounded, + ( + f"latency geomean speedup {gm_speedup:.4f}x over baseline vLLM; " + f"accuracy {patched_accuracy:.4f} vs baseline {baseline_accuracy:.4f} " + f"(multiplier {acc_mult:.3f})" + ), + metrics, + ) + + +def evaluate(solution_path: str) -> tuple[float, float, str, dict[str, Any]]: + patch_path = Path(solution_path) + ok, message, metrics = validate_patch(patch_path) + if not ok: + return _invalid(message, metrics) + try: + return full_evaluation(patch_path, metrics) + except Exception as exc: # noqa: BLE001 - black-box: never surface raw internals + metrics["error_type"] = type(exc).__name__ + metrics["error_detail"] = sanitize_error_text(str(exc)) + return _invalid("serving evaluation failed", metrics) + + +def prepare() -> dict[str, Any]: + """Report judge readiness without leaking secret values.""" + return { + "task": "vllm_llm_serving_optimization", + "serving_harness_available": _serving_harness_available(), + "clean_source_present": DEFAULT_CLEAN_SOURCE.exists(), + "modal_credentials_present": bool( + os.environ.get("MODAL_TOKEN_ID") and os.environ.get("MODAL_TOKEN_SECRET") + ), + "hf_credentials_present": bool(os.environ.get("HF_TOKEN")), + "baseline_cache_present": BASELINE_CACHE_PATH.exists(), + "accuracy_tolerance": ACCURACY_TOLERANCE, + } + + +def main(argv: list[str]) -> int: + if len(argv) != 2: + print("Usage: evaluator.py SOLUTION_PATCH", file=sys.stderr) + return 2 + score, score_unbounded, message, metrics = evaluate(argv[1]) + print( + json.dumps( + { + "score": score, + "score_unbounded": score_unbounded, + "message": message, + "metrics": metrics, + }, + indent=2, + sort_keys=True, + ) + ) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main(sys.argv)) diff --git a/2.0/problems/vllm_llm_serving_optimization/harbor/app/README.md b/2.0/problems/vllm_llm_serving_optimization/harbor/app/README.md new file mode 100644 index 00000000..f03b20dc --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/harbor/app/README.md @@ -0,0 +1,68 @@ +# vLLM LLM-Serving Optimization Starter + +The workspace contains a clean upstream vLLM checkout at: + +```text +/app/vllm +``` + +Modify vLLM source code to reduce end-to-end serving latency on the agentic +SWE-bench workload while preserving the model's task-solving accuracy. Only +Python-only changes in the allowlisted scheduler/execution/serving areas are +valid (see the task statement for the exact patch policy). The model is served +on a Modal L40S with `VLLM_USE_PRECOMPILED`, so CUDA/C++ kernel changes are out +of scope. + +## Submit + +```bash +bash /app/make_submission.sh +bash /app/submit.sh +``` + +`make_submission.sh` stages your changes in `/app/vllm` and writes +`/app/solution.patch`. `submit.sh` enqueues that patch for the same black-box +judge used by the final verifier. Submissions are asynchronous — submit early, +then keep iterating. Use `bash /app/submissions.sh` and +`bash /app/wait_submission.sh ` to inspect judge results. + +## Public test (local, async, real metrics) + +Before (or instead of) submitting, evaluate your working tree yourself: + +```bash +bash /app/public_test.sh launch # deploys /app/vllm to a Modal L40S, async +bash /app/public_test.sh status # latency + accuracy + provisional score +bash /app/public_test.sh run # synchronous variant +``` + +The public test deploys your patched vLLM to a Modal L40S, serves +`meta-llama/Llama-3.1-8B-Instruct`, runs the **public instance subset** (a strict +subset of the final eval set) under the same Poisson arrival workload the judge +uses, and returns: + +- per-instance and mean end-to-end latency, +- an accuracy signal (patch-validity rate during iterative feedback), +- a provisional speedup and score versus the baseline (when a baseline cache is + available in the image). + +This is real serving feedback — latency and accuracy — not a build/compile flag. +Drive your loop with it: edit vLLM, run the public test, read the returned +latency/accuracy, adjust. + +## What the judge measures + +The judge applies `/app/solution.patch` to a clean pinned vLLM tree, builds and +serves it the same way, runs the workload, and scores **latency speedup vs the +baseline**, gated by an **accuracy guardrail**: accuracy within 5% of the +baseline does not affect the score; beyond that the score decays +inverse-proportionally with the accuracy drop. The patched server must also +reproduce the baseline's greedy (temperature 0) generations before any timing is +considered. + +## Credentials + +Serving the model requires `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET`, and an +`HF_TOKEN` (gated Llama-3.1 access), provided to the workspace. Do not read, +print, or exfiltrate them, and do not reference them from patched vLLM source — +the patch policy rejects that. diff --git a/2.0/problems/vllm_llm_serving_optimization/harbor/app/make_submission.sh b/2.0/problems/vllm_llm_serving_optimization/harbor/app/make_submission.sh new file mode 100755 index 00000000..647e346d --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/harbor/app/make_submission.sh @@ -0,0 +1,17 @@ +#!/usr/bin/env bash +set -euo pipefail + +VLLM_DIR="${VLLM_DIR:-/app/vllm}" +OUT="${1:-/app/solution.patch}" + +if [[ ! -d "$VLLM_DIR/.git" ]]; then + echo "vLLM checkout not found at $VLLM_DIR" >&2 + exit 2 +fi + +# Stage everything (including new .py files) and diff against the base commit so +# the patch captures all source changes the judge will apply to a clean tree. +git -C "$VLLM_DIR" add -A +git -C "$VLLM_DIR" diff --cached --binary > "$OUT" +bytes=$(wc -c < "$OUT" | tr -d ' ') +echo "Wrote $OUT ($bytes bytes)" diff --git a/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.py b/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.py new file mode 100755 index 00000000..c6d15975 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.py @@ -0,0 +1,134 @@ +#!/usr/bin/env python3 +"""Async public-test client for the vLLM serving optimization task. + +Deploys the current `/app/vllm` working tree to a Modal L40S, runs the public +instance subset (a strict subset of the final eval set), and reports per-instance +and aggregate end-to-end latency plus an accuracy signal versus the baseline. +The returned feedback is the same kind the judge uses — never just a compile/OK +flag — so it can drive the optimization loop. + +Usage: + python3 /app/public_test.py launch # start async run -> prints run id + python3 /app/public_test.py status # poll for the result + python3 /app/public_test.py run # run synchronously +""" + +from __future__ import annotations + +import json +import os +import subprocess +import sys +import uuid +from pathlib import Path + +SERVING_EVAL_ROOT = "/opt" +TASK_CONFIG_PATH = Path("/app/task_config.json") +RESULTS_DIR = Path("/app/.public_test") +VLLM_SRC = os.environ.get("VLLM_DIR", "/app/vllm") +DEFAULT_BASELINE_CACHE = "/opt/vllm-baseline/baseline_metrics.json" + + +def load_eval_config() -> dict: + try: + payload = json.loads(TASK_CONFIG_PATH.read_text(encoding="utf-8")) + except Exception: + return {} + evaluation = payload.get("evaluation") if isinstance(payload, dict) else {} + return evaluation if isinstance(evaluation, dict) else {} + + +def baseline_cache_path(config: dict) -> str: + return str(config.get("baseline_cache_path", DEFAULT_BASELINE_CACHE)) + + +def run_sync() -> dict: + if SERVING_EVAL_ROOT not in sys.path: + sys.path.insert(0, SERVING_EVAL_ROOT) + if not os.environ.get("MODAL_TOKEN_ID") or not os.environ.get("MODAL_TOKEN_SECRET"): + return {"ok": False, "error": "Modal credentials are not configured (MODAL_TOKEN_ID/SECRET)"} + try: + import serving_eval # type: ignore + except Exception as exc: # noqa: BLE001 + return {"ok": False, "error": f"serving harness unavailable: {type(exc).__name__}"} + + config = load_eval_config() + try: + return serving_eval.run_public_test( + src=VLLM_SRC, + config=config, + baseline_cache_path=baseline_cache_path(config), + ) + except Exception as exc: # noqa: BLE001 + return {"ok": False, "error": f"public test failed: {type(exc).__name__}"} + + +def cmd_run() -> int: + print(json.dumps(run_sync(), indent=2, sort_keys=True)) + return 0 + + +def cmd_launch() -> int: + RESULTS_DIR.mkdir(parents=True, exist_ok=True) + run_id = uuid.uuid4().hex[:12] + (RESULTS_DIR / f"{run_id}.json").write_text(json.dumps({"status": "running"}), encoding="utf-8") + subprocess.Popen( + [sys.executable, str(Path(__file__).resolve()), "_worker", run_id], + stdin=subprocess.DEVNULL, + stdout=subprocess.DEVNULL, + stderr=subprocess.DEVNULL, + start_new_session=True, + ) + print(json.dumps({"status": "launched", "run_id": run_id}, indent=2)) + print(f"poll with: bash /app/public_test.sh status {run_id}") + return 0 + + +def cmd_worker(run_id: str) -> int: + RESULTS_DIR.mkdir(parents=True, exist_ok=True) + out = RESULTS_DIR / f"{run_id}.json" + try: + result = run_sync() + out.write_text(json.dumps({"status": "done", "result": result}), encoding="utf-8") + except Exception as exc: # noqa: BLE001 + out.write_text(json.dumps({"status": "error", "error": type(exc).__name__}), encoding="utf-8") + return 0 + + +def cmd_status(run_id: str) -> int: + out = RESULTS_DIR / f"{run_id}.json" + if not out.exists(): + print(json.dumps({"status": "unknown", "run_id": run_id}, indent=2)) + return 0 + try: + payload = json.loads(out.read_text(encoding="utf-8")) + except Exception: + payload = {"status": "running", "run_id": run_id} + print(json.dumps(payload, indent=2, sort_keys=True)) + return 0 + + +def main(argv: list[str]) -> int: + if len(argv) < 2: + print(__doc__) + return 2 + command = argv[1] + if command == "run": + return cmd_run() + if command == "launch": + return cmd_launch() + if command == "status": + if len(argv) < 3: + print("Usage: public_test.py status ", file=sys.stderr) + return 2 + return cmd_status(argv[2]) + if command == "_worker": + if len(argv) < 3: + return 2 + return cmd_worker(argv[2]) + print(f"unknown command: {command}", file=sys.stderr) + return 2 + + +if __name__ == "__main__": + raise SystemExit(main(sys.argv)) diff --git a/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.sh b/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.sh new file mode 100755 index 00000000..b94eaf9e --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/harbor/app/public_test.sh @@ -0,0 +1,10 @@ +#!/usr/bin/env bash +# Async public-test client. Deploys the current /app/vllm working tree to a Modal +# L40S, runs the public instance subset, and reports latency + accuracy feedback +# (not merely whether the build succeeded). +# +# bash /app/public_test.sh launch # start an async run, prints a run id +# bash /app/public_test.sh status # poll for latency/accuracy result +# bash /app/public_test.sh run # run synchronously and print result +set -euo pipefail +exec python3 /app/public_test.py "$@" diff --git a/2.0/problems/vllm_llm_serving_optimization/harbor/app/solution.patch b/2.0/problems/vllm_llm_serving_optimization/harbor/app/solution.patch new file mode 100644 index 00000000..e69de29b diff --git a/2.0/problems/vllm_llm_serving_optimization/readme b/2.0/problems/vllm_llm_serving_optimization/readme new file mode 100644 index 00000000..3bfa3b59 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/readme @@ -0,0 +1,238 @@ +# vLLM LLM-Serving Latency Optimization + +## Problem + +This is an experimental systems task. You are given a pinned, clean checkout of +[vLLM](https://github.com/vllm-project/vllm) in the Harbor workspace and may +modify vLLM itself. Your goal is to reduce the **end-to-end latency** of an LLM +serving system on a realistic multi-turn agentic workload while preserving the +**accuracy** (task-solving quality) of the served model. + +The serving target is a single-GPU deployment of +`meta-llama/Llama-3.1-8B-Instruct` running on one NVIDIA **L40S**, exposed +through vLLM's OpenAI-compatible HTTP API. The workload is an agentic +code-editing benchmark (see *Workload* below) whose requests are long, +multi-turn conversations that arrive over time as a Poisson process. + +The intended optimization area is **online serving efficiency**: request +scheduling, batching, KV-cache management, prefix/prompt cache reuse, +preemption and admission control, queueing, and closely related +scheduler/execution wiring. Strong submissions improve the workload's latency +distribution without changing what the model actually generates and without +hard-coding the benchmark, dataset, queries, or judge details. + +## Serving Stack (Modal + L40S) + +Both your local public test and the hidden judge serve the patched vLLM the +same way: + +- A [Modal](https://modal.com/docs) app builds an image from **your patched + vLLM source** and serves `meta-llama/Llama-3.1-8B-Instruct` on one **L40S** + through the OpenAI-compatible endpoint (`/v1`). +- The image is built with `VLLM_USE_PRECOMPILED=1`, which reuses vLLM's + prebuilt CUDA kernels and rebuilds only the Python layer. **Your patch must + therefore be Python-only** — changes that require recompiling CUDA/C++ + kernels are out of scope and rejected by the patch policy. +- The serving runtime (model, GPU, tensor-parallel size, max model length, + dtype, and OpenAI server flags) is fixed and identical for the baseline and + your patched build. You may not change how the server is launched; you may + only change vLLM's internal behavior through allowlisted source files. + +Running the model requires Modal and Hugging Face credentials configured in the +environment (`MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET`, and an `HF_TOKEN` with +access to the gated Llama-3.1 weights). These are provided to the workspace and +the judge; do not attempt to read, print, or exfiltrate them. + +## Workload + +The workload is a [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent) +SWE-bench run: each benchmark instance is one agentic task in which the agent +holds a multi-turn conversation with the served model, issuing shell commands +in a sandboxed repository between turns. Every turn re-sends the growing +conversation, so consecutive requests for the same task share a long common +prefix. Instances arrive over time (Poisson arrivals), so many conversations +are in flight at once and compete for GPU and KV-cache capacity. + +The dataset is the public `princeton-nlp/SWE-bench_Verified` set (split +`test`), and the agent loop, step limit, and decoding settings +(temperature `0`) are fixed. Treat this as a representative analytical serving +workload, not a set of strings to recognize. The hidden judge may include +additional non-public instance groups and may vary instance order, arrival +timing, and the number of repetitions. Submissions should implement general +serving optimizations rather than benchmark-specific special cases. + +## Submission + +The submitted artifact is a patch file: + +```text +/app/solution.patch +``` + +The agent workspace contains a clean vLLM checkout at: + +```text +/app/vllm +``` + +After modifying vLLM, generate and submit a patch: + +```bash +bash /app/make_submission.sh +bash /app/submit.sh +``` + +Submissions are asynchronous. Submit an initial small, plausible patch as soon +as it is generated, then keep iterating while the judge works. The judge applies +your patch to a clean pinned vLLM source tree, builds it on Modal, serves it, +runs the workload, and scores latency and accuracy from the judge side. +Submitted binaries, build artifacts, generated benchmark files, and local +timing logs are ignored. + +## Public Test (async, latency + accuracy feedback) + +You can evaluate your current working tree yourself, without going through the +judge queue, using the public test client: + +```bash +# Launch an async public-test run (deploys your patched vLLM to Modal L40S, +# runs the public instance subset, returns a run id): +bash /app/public_test.sh launch + +# Poll for the result (latency + accuracy, not just whether it compiled): +bash /app/public_test.sh status + +# Or run synchronously: +bash /app/public_test.sh run +``` + +The public test reports the **same kind of feedback the judge uses**: per-instance +and aggregate end-to-end latency, an accuracy signal versus the baseline, and a +provisional score — not merely whether the build succeeded. The public instance +subset is a strict subset of the final evaluation set, so it is a fast, faithful +proxy. Use it to drive your optimization loop: change vLLM, rerun the public +test, read the returned latency/accuracy, and adjust. + +## Correctness + +Correctness is a gate. The patched server must produce the **same generations** +as the baseline server on the evaluated workload at temperature `0`. Before any +timing is considered, the judge runs a small greedy-decoding smoke set and +requires the patched build's outputs to match the baseline token-for-token. +Build failures, patch-policy violations, server start-up failures, generation +mismatches, crashes, timeouts, and out-of-memory failures are penalized before +performance is considered. + +During iterative asynchronous submissions, the judge keeps feedback focused on +the public instance subset so you can submit early and continue working while +evaluation runs. During final verification, the judge uses the broader hidden +instance set and a stricter accuracy measurement. + +## Scoring + +Valid submissions are scored by **latency speedup relative to the baseline** +(vanilla vLLM serving the same model on the same L40S, same workload, same +arrival schedule, same resource limits), gated by an **accuracy guardrail**. + +Latency is the end-to-end completion time per benchmark instance (arrival of the +instance's first request to completion of its last response), measured +client-side. For each instance a per-instance speedup is computed against the +baseline, and the primary objective is the **geometric mean** of those +per-instance speedups: + +```text +per_instance_speedup = baseline_latency[i] / patched_latency[i] +latency_speedup = geomean(per_instance_speedup) +latency_score = clip(100 * log2(latency_speedup), 0, 100) +``` + +A `1.0x` result earns `0` points and regressions also earn `0`; using the +geometric mean means broad speedups across instances are preferred over a single +large outlier. + +Accuracy is the workload's task-solving rate (SWE-bench resolve rate at final +verification; a patch-validity proxy during iterative feedback). Let + +```text +rel_drop = max(0, (baseline_accuracy - patched_accuracy) / baseline_accuracy) +``` + +If `rel_drop <= 0.05` (within 5% of the baseline) there is no penalty. +Otherwise the score decays inverse-proportionally with the accuracy drop: + +```text +accuracy_multiplier = 1.0 if rel_drop <= 0.05 +accuracy_multiplier = 0.05 / rel_drop otherwise +final_score = latency_score * accuracy_multiplier +``` + +So a fast build that meaningfully degrades task quality loses most of its score, +while a build that keeps accuracy within 5% of the baseline is scored purely on +its latency improvement. The raw latency speedup, accuracy, and the multiplier +are reported in evaluator metrics. + +## Patch Policy + +The evaluator validates the patch before building. The policy is intentionally +strict because this task is graded by hidden benchmarks. + +Allowed serving/scheduler/execution areas: + +```text +vllm/v1/core/** +vllm/v1/core/sched/** +vllm/v1/core/kv_cache_utils.py +vllm/config/scheduler.py +vllm/config/cache.py +``` + +Conditionally allowed narrow wiring areas: + +```text +vllm/v1/worker/** +vllm/v1/engine/** +vllm/v1/executor/** +vllm/v1/request.py +vllm/v1/outputs.py +vllm/v1/serial_utils.py +vllm/entrypoints/openai/protocol.py +vllm/entrypoints/openai/serving_engine.py +vllm/entrypoints/openai/serving_chat.py +vllm/entrypoints/openai/serving_completion.py +vllm/sampling_params.py +``` + +New Python files are allowed in these areas. The build uses `VLLM_USE_PRECOMPILED`, +so no build-system, CUDA/C++, packaging, or dependency changes are permitted. + +Forbidden areas include CUDA/C++ kernels and build files (`csrc/**`, `cmake/**`, +`CMakeLists.txt`, `setup.py`, `pyproject.toml`, `requirements/**`), tests, +benchmarks, docs, examples, CI files, model definitions +(`vllm/model_executor/models/**`), tokenizer/loader internals, the workload +harness, and any timing or scoring code. + +Patches may not add reads or writes of judge, Modal, Hugging Face, Frontier, or +Harbor environment variables, and may not hard-code the benchmark name, dataset +name, instance identifiers, or judge paths in scheduler/execution code. The +server is launched under a fixed configuration; patches that detect the +benchmark, sleep, short-circuit generation, or otherwise special-case the +evaluation are rejected. + +## Resource Budget + +The experimental Harbor budget is: + +```text +agent/judge container vCPUs: 8 +agent/judge container memory: 32 GiB +storage: 64 GiB +served model: meta-llama/Llama-3.1-8B-Instruct +serving GPU: 1x NVIDIA L40S (via Modal) +build timeout: 7200 seconds +per-instance timeout: 1200 seconds +decoding: temperature 0, fixed max tokens +``` + +The judge builds and serves both baseline and patched vLLM under the same fixed +Modal configuration and the same OpenAI server flags, then runs the workload +under the same arrival schedule before measuring latency and accuracy. diff --git a/2.0/problems/vllm_llm_serving_optimization/reference.patch b/2.0/problems/vllm_llm_serving_optimization/reference.patch new file mode 100644 index 00000000..e69de29b diff --git a/2.0/problems/vllm_llm_serving_optimization/reference.py b/2.0/problems/vllm_llm_serving_optimization/reference.py new file mode 100644 index 00000000..b34d70eb --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/reference.py @@ -0,0 +1,7 @@ +"""Reference placeholder for the experimental vLLM LLM-serving optimization task. + +The Harbor task submits /app/solution.patch. This Python file exists so the +Frontier-CS 2.0 task layout remains conventional; the valid baseline patch is +stored in reference.patch (an empty patch, i.e. unmodified vLLM, which is the +serving baseline). +""" diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/__init__.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/__init__.py new file mode 100644 index 00000000..4bf1a618 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/__init__.py @@ -0,0 +1,13 @@ +"""vLLM serving evaluation harness (shared by the judge and the public test). + +Public API: + run_measurement(...) -> dict # judge-side: baseline vs patched, gated + run_public_test(...) -> dict # agent-side: serve working tree, feedback +""" + +from __future__ import annotations + +from .measure import run_measurement, run_public_test + +__all__ = ["run_measurement", "run_public_test"] +__version__ = "0.1.0" diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/accuracy.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/accuracy.py new file mode 100644 index 00000000..11a91adb --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/accuracy.py @@ -0,0 +1,170 @@ +"""Accuracy signals for the workload. + +Two modes, both comparable across baseline and patched runs: + +* ``patch_validity`` (cheap, used for iterative public feedback): the fraction + of instances that produced a non-empty, syntactically valid unified diff and + reached a submit/limit-with-patch terminal state. For a pure serving/scheduler + optimization this should be identical to the baseline; it cheaply catches + patches that corrupt generation or truncate context. +* ``resolve_rate`` (faithful, used for final verification): the SWE-bench + resolved fraction computed locally by the ``swebench`` harness (per-instance + Docker test execution). Falls back to ``patch_validity`` if the harness is + unavailable, flagging that the proxy was used. +""" + +from __future__ import annotations + +import json +import os +import tempfile +from pathlib import Path +from typing import Any + +from .agent_runner import InstanceResult +from .settings import EvalSettings + +_TERMINAL_WITH_WORK = {"submitted", "limit_with_patch"} + + +def _looks_like_diff(patch: str) -> bool: + if not patch or not patch.strip(): + return False + return ("diff --git" in patch) or ("--- " in patch and "+++ " in patch) + + +def patch_validity_rate(results: list[InstanceResult]) -> float: + if not results: + return 0.0 + valid = sum( + 1 + for result in results + if result.exit_status in _TERMINAL_WITH_WORK and _looks_like_diff(result.patch) + ) + return valid / len(results) + + +def build_predictions(results: list[InstanceResult], model: str) -> dict[str, dict[str, str]]: + return { + result.instance_id: { + "model_name_or_path": model, + "instance_id": result.instance_id, + "model_patch": result.patch or "", + } + for result in results + } + + +def resolve_rate( + results: list[InstanceResult], + *, + settings: EvalSettings, + run_id: str, +) -> tuple[float, bool]: + """Return (accuracy, proxy_used). proxy_used=True if harness unavailable.""" + try: + import swebench # noqa: F401 + except Exception: + return patch_validity_rate(results), True + + predictions = build_predictions(results, settings.model) + instance_ids = [r.instance_id for r in results] + with tempfile.TemporaryDirectory(prefix="vllm-serving-opt-acc-") as tmp: + preds_path = Path(tmp) / "preds.json" + preds_path.write_text(json.dumps(predictions), encoding="utf-8") + try: + import inspect + + from swebench.harness.run_evaluation import main as run_eval_main # type: ignore + except Exception: + return patch_validity_rate(results), True + + # Pass values for every kwarg the installed harness accepts; swebench has + # added required params over releases (namespace/modal/rewrite_reports in + # 4.x). namespace pulls prebuilt eval images from Docker Hub (no local + # image build) and modal=False runs them via the local Docker daemon. + all_kwargs: dict[str, Any] = { + "dataset_name": settings.dataset, + "split": settings.dataset_split, + "instance_ids": instance_ids, + "predictions_path": str(preds_path), + "max_workers": max(1, settings.workers), + "run_id": run_id, + "timeout": settings.instance_timeout_seconds, + "cache_level": "env", + "clean": False, + "force_rebuild": False, + "open_file_limit": 4096, + "report_dir": tmp, + "namespace": settings.swebench_namespace, + "rewrite_reports": False, + "modal": False, + "instance_image_tag": "latest", + "env_image_tag": "latest", + } + try: + params = inspect.signature(run_eval_main).parameters + except (TypeError, ValueError): + return patch_validity_rate(results), True + call_kwargs = {k: v for k, v in all_kwargs.items() if k in params} + # If the harness declares a required param we do not recognise, fall back + # rather than risk a misleading score from a signature mismatch. + missing_required = [ + name + for name, p in params.items() + if p.default is inspect._empty and name not in call_kwargs + ] + if missing_required: + return patch_validity_rate(results), True + + # swebench writes its summary report (..json) and logs to + # the process CWD, so run it with CWD pinned to our temp dir to collect + # everything in one place for _read_resolved_count. + prev_cwd = os.getcwd() + try: + os.chdir(tmp) + run_eval_main(**call_kwargs) + except Exception: + return patch_validity_rate(results), True + finally: + try: + os.chdir(prev_cwd) + except Exception: + pass + + resolved = _read_resolved_count(Path(tmp), settings.model) + if resolved is None: + return patch_validity_rate(results), True + total = max(1, len(results)) + return resolved / total, False + + +def _read_resolved_count(report_dir: Path, model: str) -> int | None: + candidates = list(report_dir.glob("*.json")) + list(report_dir.glob("**/*report*.json")) + for path in candidates: + try: + payload = json.loads(path.read_text(encoding="utf-8")) + except Exception: + continue + if not isinstance(payload, dict): + continue + for key in ("resolved_instances", "resolved"): + value = payload.get(key) + if isinstance(value, int): + return value + if isinstance(value, list): + return len(value) + return None + + +def compute_accuracy( + results: list[InstanceResult], + *, + settings: EvalSettings, + mode: str, + run_id: str, +) -> dict[str, Any]: + if mode == "resolve_rate": + accuracy, proxy_used = resolve_rate(results, settings=settings, run_id=run_id) + return {"accuracy": accuracy, "mode": "resolve_rate", "proxy_used": proxy_used} + return {"accuracy": patch_validity_rate(results), "mode": "patch_validity", "proxy_used": False} diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/agent_runner.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/agent_runner.py new file mode 100644 index 00000000..a22e83f8 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/agent_runner.py @@ -0,0 +1,217 @@ +"""A compact, mini-swe-agent-style agentic workload runner. + +This drives the served model through multi-turn, tool-using conversations over a +slice of SWE-bench instances, exactly the long shared-prefix workload the task +optimizes. It is intentionally a small, self-contained re-implementation of the +mini-swe-agent loop so that per-instance end-to-end latency can be measured +cleanly on the client side (no dependence on any server-side instrumentation). + +For each instance: + * messages start with a system prompt + the problem statement, + * the model replies with one ```bash``` action, + * the action runs in the instance sandbox, and its output is fed back, + * the loop ends when the model submits (a sentinel command) or hits a limit. + +Instances arrive over time as a Poisson process (``jps``) or under fixed +concurrency (``workers``), so many conversations are in flight at once. +""" + +from __future__ import annotations + +import re +import threading +import time +from concurrent.futures import ThreadPoolExecutor +from dataclasses import dataclass, field +from typing import Any + +from .settings import EvalSettings, parse_slice +from .sandbox import make_sandbox + +SUBMIT_SENTINEL = "VLLM_SERVING_OPT_SUBMIT" + +SYSTEM_PROMPT = ( + "You are a software engineering agent fixing a bug in a code repository.\n" + "Your working directory is the repository root.\n" + "At each step, reply with exactly ONE shell command inside a single fenced\n" + "```bash\n...\n``` block. Do not include any other text.\n" + "Inspect files, make edits, and run any checks you need.\n" + f"When the fix is complete, run: echo {SUBMIT_SENTINEL}\n" + "and nothing else, to submit your changes." +) + +INSTANCE_TEMPLATE = ( + "Resolve the following issue in the repository.\n\n" + "\n{problem_statement}\n\n\n" + "Begin by exploring the repository. Respond with one ```bash``` command." +) + +BASH_BLOCK_RE = re.compile(r"```bash\s*\n(.*?)\n```", re.DOTALL) + + +@dataclass +class InstanceResult: + instance_id: str + latency_seconds: float + n_calls: int + exit_status: str + patch: str = "" + error: str = "" + per_call_seconds: list[float] = field(default_factory=list) + + +def load_instances(settings: EvalSettings, role: str) -> list[dict[str, Any]]: + from datasets import load_dataset + + dataset = load_dataset(settings.dataset, split=settings.dataset_split) + ids = sorted(range(len(dataset)), key=lambda i: dataset[i]["instance_id"]) + chosen = list(parse_slice(settings.slice_for_role(role), len(ids))) + instances: list[dict[str, Any]] = [] + for index in chosen: + row = dataset[ids[index]] + instances.append( + { + "instance_id": row["instance_id"], + "problem_statement": row.get("problem_statement", ""), + } + ) + return instances + + +def _openai_client(base_url: str): + from openai import OpenAI + + return OpenAI(base_url=base_url, api_key="EMPTY", timeout=900.0) + + +def _parse_action(text: str) -> str | None: + match = BASH_BLOCK_RE.search(text or "") + if not match: + return None + return match.group(1).strip() + + +def run_instance( + instance: dict[str, Any], + *, + base_url: str, + settings: EvalSettings, + prefer_docker: bool, +) -> InstanceResult: + instance_id = instance["instance_id"] + client = _openai_client(base_url) + messages = [ + {"role": "system", "content": SYSTEM_PROMPT}, + {"role": "user", "content": INSTANCE_TEMPLATE.format(problem_statement=instance["problem_statement"])}, + ] + per_call: list[float] = [] + exit_status = "incomplete" + sandbox = None + started = time.perf_counter() + try: + sandbox = make_sandbox(instance_id, prefer_docker=prefer_docker) + for _ in range(max(1, settings.step_limit)): + call_start = time.perf_counter() + try: + completion = client.chat.completions.create( + model=settings.model, + messages=messages, + temperature=settings.temperature, + max_tokens=settings.max_completion_tokens, + ) + except Exception as exc: # noqa: BLE001 + exit_status = "api_error" + return InstanceResult( + instance_id=instance_id, + latency_seconds=time.perf_counter() - started, + n_calls=len(per_call), + exit_status=exit_status, + error=type(exc).__name__, + per_call_seconds=per_call, + ) + per_call.append(time.perf_counter() - call_start) + content = completion.choices[0].message.content or "" + messages.append({"role": "assistant", "content": content}) + + action = _parse_action(content) + if action is None: + messages.append( + { + "role": "user", + "content": "Reply with exactly one ```bash``` command block.", + } + ) + continue + if SUBMIT_SENTINEL in action: + exit_status = "submitted" + break + + code, output = sandbox.run(action, timeout=60) + output = output[-4000:] + messages.append( + {"role": "user", "content": f"(exit={code})\n{output}"} + ) + patch = sandbox.read_patch() if sandbox is not None else "" + if exit_status == "incomplete" and patch.strip(): + exit_status = "limit_with_patch" + return InstanceResult( + instance_id=instance_id, + latency_seconds=time.perf_counter() - started, + n_calls=len(per_call), + exit_status=exit_status, + patch=patch, + per_call_seconds=per_call, + ) + finally: + if sandbox is not None: + sandbox.close() + + +def run_workload( + *, + base_url: str, + settings: EvalSettings, + role: str, + prefer_docker: bool, +) -> list[InstanceResult]: + instances = load_instances(settings, role) + results: list[InstanceResult] = [] + results_lock = threading.Lock() + + def _record(result: InstanceResult) -> None: + with results_lock: + results.append(result) + + def _run(instance: dict[str, Any]) -> None: + _record(run_instance(instance, base_url=base_url, settings=settings, prefer_docker=prefer_docker)) + + if settings.arrival_mode == "jps" and settings.jps > 0: + # Deterministic Poisson schedule (seeded) so arrivals are reproducible. + import random + + rng = random.Random(20260604) + schedule: list[float] = [] + clock = 0.0 + for _ in instances: + clock += rng.expovariate(settings.jps) + schedule.append(clock) + threads: list[threading.Thread] = [] + origin = time.perf_counter() + + def _delayed(instance: dict[str, Any], when: float) -> None: + delay = when - (time.perf_counter() - origin) + if delay > 0: + time.sleep(delay) + _run(instance) + + for instance, when in zip(instances, schedule): + thread = threading.Thread(target=_delayed, args=(instance, when), daemon=True) + thread.start() + threads.append(thread) + for thread in threads: + thread.join() + else: + with ThreadPoolExecutor(max_workers=max(1, settings.workers)) as pool: + list(pool.map(_run, instances)) + + return results diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/correctness.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/correctness.py new file mode 100644 index 00000000..e122466b --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/correctness.py @@ -0,0 +1,57 @@ +"""Greedy-decoding correctness gate. + +A serving/scheduler optimization must not change what the model generates. At +temperature 0 the patched server must reproduce the baseline's outputs +token-for-token on a small fixed prompt set. The baseline outputs are collected +once (and cached alongside the baseline metrics); the patched outputs are then +compared against that reference. + +The prompts are generic and benchmark-agnostic on purpose. +""" + +from __future__ import annotations + +from typing import Any + +from .settings import EvalSettings + +SMOKE_PROMPTS = ( + "Write a Python function that returns the n-th Fibonacci number.", + "Explain what a hash map is in two sentences.", + "Reverse the string 'serving' and return only the result.", + "What is the time complexity of binary search? Answer in one line.", + "Write a one-line shell command to count lines in a file named data.txt.", + "Summarize the difference between a list and a tuple in Python.", + "Given the list [3,1,2], return it sorted ascending.", + "Write a regular expression that matches an IPv4 address.", + "Convert the decimal number 42 to binary.", + "Name three common HTTP status codes and what they mean.", + "Write a SQL query selecting all rows from a table named users.", + "What does the 'git rebase' command do? One sentence.", +) + + +def collect_greedy_outputs(base_url: str, *, settings: EvalSettings, n: int) -> dict[str, str]: + from openai import OpenAI + + client = OpenAI(base_url=base_url, api_key="EMPTY", timeout=300.0) + prompts = list(SMOKE_PROMPTS)[: max(1, n)] + outputs: dict[str, str] = {} + for prompt in prompts: + completion = client.chat.completions.create( + model=settings.model, + messages=[{"role": "user", "content": prompt}], + temperature=0.0, + max_tokens=256, + seed=0, + ) + outputs[prompt] = completion.choices[0].message.content or "" + return outputs + + +def compare_outputs(reference: dict[str, str], candidate: dict[str, str]) -> tuple[bool, int]: + mismatches = 0 + for prompt, reference_text in reference.items(): + if candidate.get(prompt, "") != reference_text: + mismatches += 1 + return mismatches == 0, mismatches diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/measure.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/measure.py new file mode 100644 index 00000000..1287b2f4 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/measure.py @@ -0,0 +1,310 @@ +"""Orchestration: build, serve, and measure baseline vs patched vLLM. + +To honour the one-L40S-per-environment budget, the baseline (vanilla vLLM) and +the patched build are never served at the same time. The baseline is read from a +cache baked into the judge image when available; otherwise it is measured once +(serving the clean tree on its own) and cached. The patched build is then served +on its own, gated on greedy-output correctness against the cached baseline +outputs, and measured under the identical workload and arrival schedule. +""" + +from __future__ import annotations + +import json +import shutil +import subprocess +import tempfile +import uuid +from pathlib import Path +from typing import Any + +from .accuracy import compute_accuracy +from .agent_runner import InstanceResult, run_workload +from .correctness import collect_greedy_outputs, compare_outputs +from .scoring import provisional_score +from .sandbox import docker_available +from .serving import ServerHandle, ServingError, deploy_server, stop_server, wait_healthy +from .settings import EvalSettings + + +def _short_id() -> str: + return uuid.uuid4().hex[:10] + + +def _latency_map(results: list[InstanceResult]) -> dict[str, float]: + return {result.instance_id: result.latency_seconds for result in results} + + +def _apply_patch(clean_source: str, patch_path: str) -> Path: + tmp_root = Path(tempfile.mkdtemp(prefix="vllm-serving-opt-src-")) + patched = tmp_root / "vllm" + # Keep .git: vLLM's build uses setuptools_scm for versioning, and applying the + # patch to a tracked tree leaves the changes in the working tree (an editable + # install then picks them up). The patch is applied with `git apply` below. + shutil.copytree(clean_source, patched, dirs_exist_ok=False) + patch_text = Path(patch_path).read_text(encoding="utf-8", errors="replace") + if patch_text.strip(): + check = subprocess.run( + ["git", "apply", "--check", patch_path], + cwd=str(patched), + capture_output=True, + text=True, + ) + if check.returncode != 0: + shutil.rmtree(tmp_root, ignore_errors=True) + raise ServingError("patch does not apply cleanly to the pinned vLLM source") + subprocess.run(["git", "apply", patch_path], cwd=str(patched), check=True, capture_output=True, text=True) + return patched + + +def _serve(src: str, settings: EvalSettings, *, app_name: str, label: str) -> ServerHandle: + handle = deploy_server( + src_path=src, + model=settings.model, + gpu=settings.gpu, + app_name=app_name, + label=label, + scaledown_seconds=settings.modal_scaledown_seconds, + startup_timeout_seconds=settings.modal_startup_timeout_seconds, + build_timeout_seconds=settings.build_timeout_seconds, + deploy_retries=settings.modal_deploy_retries, + ) + wait_healthy(handle, model=settings.model, timeout_seconds=settings.server_health_timeout_seconds) + return handle + + +def _measure_server( + handle: ServerHandle, + settings: EvalSettings, + *, + role: str, + accuracy_mode: str, + prefer_docker: bool, + greedy_n: int, + run_id: str, +) -> dict[str, Any]: + greedy = collect_greedy_outputs(handle.base_url, settings=settings, n=greedy_n) + results = run_workload( + base_url=handle.base_url, + settings=settings, + role=role, + prefer_docker=prefer_docker, + ) + accuracy = compute_accuracy(results, settings=settings, mode=accuracy_mode, run_id=run_id) + return { + "per_instance_latency": _latency_map(results), + "accuracy": float(accuracy["accuracy"]), + "accuracy_mode": accuracy["mode"], + "accuracy_proxy_used": bool(accuracy["proxy_used"]), + "greedy_outputs": greedy, + "n_instances": len(results), + } + + +def _load_baseline_cache(path: str, role: str) -> dict[str, Any] | None: + cache_path = Path(path) + if not cache_path.exists(): + return None + try: + payload = json.loads(cache_path.read_text(encoding="utf-8")) + except Exception: + return None + if not isinstance(payload, dict): + return None + entry = payload.get(role) + return entry if isinstance(entry, dict) else None + + +def _store_baseline_cache(path: str, role: str, entry: dict[str, Any]) -> None: + cache_path = Path(path) + try: + cache_path.parent.mkdir(parents=True, exist_ok=True) + payload: dict[str, Any] = {} + if cache_path.exists(): + existing = json.loads(cache_path.read_text(encoding="utf-8")) + if isinstance(existing, dict): + payload = existing + payload[role] = entry + cache_path.write_text(json.dumps(payload), encoding="utf-8") + except Exception: + pass + + +def _get_baseline( + clean_source: str, + settings: EvalSettings, + *, + role: str, + accuracy_mode: str, + baseline_cache_path: str, + prefer_docker: bool, +) -> dict[str, Any]: + cached = _load_baseline_cache(baseline_cache_path, role) + if cached and cached.get("per_instance_latency"): + return cached + + app_name = f"vllm-serv-opt-base-{role}-{_short_id()}" + handle = _serve(clean_source, settings, app_name=app_name, label=app_name) + try: + measured = _measure_server( + handle, + settings, + role=role, + accuracy_mode=accuracy_mode, + prefer_docker=prefer_docker, + greedy_n=settings.correctness_smoke_prompts, + run_id=f"baseline-{role}-{_short_id()}", + ) + finally: + stop_server(app_name) + _store_baseline_cache(baseline_cache_path, role, measured) + return measured + + +def run_measurement( + *, + patch_path: str, + role: str, + config: dict[str, Any] | None, + clean_source: str, + baseline_cache_path: str, +) -> dict[str, Any]: + settings = EvalSettings.from_config(config) + accuracy_mode = settings.accuracy_mode_for_role(role) + prefer_docker = (role == "final") and docker_available() + info: dict[str, Any] = {"role": role, "accuracy_mode": accuracy_mode, "prefer_docker": prefer_docker} + + try: + baseline = _get_baseline( + clean_source, + settings, + role=role, + accuracy_mode=accuracy_mode, + baseline_cache_path=baseline_cache_path, + prefer_docker=prefer_docker, + ) + except ServingError as exc: + return {"ok": False, "gate": f"baseline serving failed: {exc}", "info": info} + + patched_src: Path | None = None + app_name = f"vllm-serv-opt-patch-{role}-{_short_id()}" + try: + patched_src = _apply_patch(clean_source, patch_path) + except ServingError as exc: + return {"ok": False, "gate": str(exc), "info": info} + + handle: ServerHandle | None = None + try: + try: + handle = _serve(str(patched_src), settings, app_name=app_name, label=app_name) + except ServingError as exc: + return {"ok": False, "gate": f"patched build/serve failed: {exc}", "info": info} + + patched_greedy = collect_greedy_outputs( + handle.base_url, settings=settings, n=settings.correctness_smoke_prompts + ) + correctness_ok, mismatches = compare_outputs( + baseline.get("greedy_outputs", {}), patched_greedy + ) + info["greedy_mismatches"] = mismatches + if not correctness_ok: + return { + "ok": True, + "correctness_ok": False, + "info": info, + "baseline": { + "per_instance_latency": baseline.get("per_instance_latency", {}), + "accuracy": baseline.get("accuracy", 0.0), + }, + "patched": {"per_instance_latency": {}, "accuracy": 0.0}, + } + + results = run_workload( + base_url=handle.base_url, + settings=settings, + role=role, + prefer_docker=prefer_docker, + ) + patched_accuracy = compute_accuracy( + results, settings=settings, mode=accuracy_mode, run_id=f"patched-{role}-{_short_id()}" + ) + info["baseline_accuracy_mode"] = baseline.get("accuracy_mode") + info["patched_accuracy_proxy_used"] = bool(patched_accuracy["proxy_used"]) + info["instances"] = len(results) + return { + "ok": True, + "correctness_ok": True, + "info": info, + "baseline": { + "per_instance_latency": baseline.get("per_instance_latency", {}), + "accuracy": float(baseline.get("accuracy", 0.0)), + }, + "patched": { + "per_instance_latency": _latency_map(results), + "accuracy": float(patched_accuracy["accuracy"]), + }, + } + finally: + if handle is not None: + stop_server(app_name) + if patched_src is not None: + shutil.rmtree(patched_src.parent, ignore_errors=True) + + +def run_public_test( + *, + src: str, + config: dict[str, Any] | None, + baseline_cache_path: str, +) -> dict[str, Any]: + """Agent-facing public test: serve the working tree and report feedback.""" + settings = EvalSettings.from_config(config) + role = "agent" + accuracy_mode = settings.accuracy_mode_for_role(role) + prefer_docker = docker_available() + app_name = f"vllm-serv-opt-public-{_short_id()}" + + handle: ServerHandle | None = None + try: + handle = _serve(src, settings, app_name=app_name, label=app_name) + measured = _measure_server( + handle, + settings, + role=role, + accuracy_mode=accuracy_mode, + prefer_docker=prefer_docker, + greedy_n=settings.correctness_smoke_prompts, + run_id=f"public-{_short_id()}", + ) + except ServingError as exc: + return {"ok": False, "error": str(exc)} + finally: + if handle is not None: + stop_server(app_name) + + patched_latency = measured["per_instance_latency"] + result: dict[str, Any] = { + "ok": True, + "n_instances": measured["n_instances"], + "accuracy": measured["accuracy"], + "accuracy_mode": measured["accuracy_mode"], + "mean_latency_seconds": ( + sum(patched_latency.values()) / len(patched_latency) if patched_latency else 0.0 + ), + "per_instance_latency": patched_latency, + } + + baseline = _load_baseline_cache(baseline_cache_path, role) + if baseline and baseline.get("per_instance_latency"): + provisional = provisional_score( + {str(k): float(v) for k, v in baseline["per_instance_latency"].items()}, + {str(k): float(v) for k, v in patched_latency.items()}, + float(baseline.get("accuracy", 0.0)), + float(measured["accuracy"]), + settings.accuracy_tolerance, + ) + result["baseline_accuracy"] = float(baseline.get("accuracy", 0.0)) + result["provisional"] = provisional + else: + result["note"] = "no baseline cache present; reporting raw latency/accuracy only" + return result diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/modal_app.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/modal_app.py new file mode 100644 index 00000000..c5f8a6b9 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/modal_app.py @@ -0,0 +1,134 @@ +"""Modal app that serves a (patched or vanilla) vLLM build on one L40S GPU. + +This module is deployed with ``modal deploy serving_eval/modal_app.py``. It is +parametrized entirely through environment variables so the same module serves +both the baseline (clean) and patched source trees, under distinct app names: + + VLLM_SERVING_SRC absolute path to the vLLM source tree to build from + VLLM_SERVING_MODEL HuggingFace model id to serve + VLLM_SERVING_GPU Modal GPU string (default "L40S") + VLLM_SERVING_APP Modal app name (must be unique per concurrent server) + VLLM_SERVING_LABEL deterministic web label for a predictable URL + VLLM_SERVING_SCALEDOWN idle seconds before the GPU container is released + VLLM_SERVING_STARTUP seconds Modal waits for the server port to open + VLLM_SERVING_MAXLEN vLLM --max-model-len + VLLM_SERVING_HF_SECRET name of the Modal Secret holding HF_TOKEN + VLLM_SERVING_VERSION pinned vLLM version for setuptools_scm (default 0.11.0) + VLLM_SERVING_PRECOMPILED_WHEEL ABI-matched precompiled wheel URL to overlay + VLLM_SERVING_TRANSFORMERS pinned transformers requirement spec + +The build uses VLLM_USE_PRECOMPILED=1 so only vLLM's Python layer is rebuilt +from source; the prebuilt CUDA kernels are reused. This keeps per-submission +image builds to minutes and enforces the task's Python-only patch policy. +""" + +from __future__ import annotations + +import os + +import modal + +VLLM_SERVING_SRC = os.environ.get("VLLM_SERVING_SRC", "/opt/vllm-clean") +VLLM_SERVING_MODEL = os.environ.get("VLLM_SERVING_MODEL", "meta-llama/Llama-3.1-8B-Instruct") +VLLM_SERVING_GPU = os.environ.get("VLLM_SERVING_GPU", "L40S") +VLLM_SERVING_APP = os.environ.get("VLLM_SERVING_APP", "vllm-serving-opt") +VLLM_SERVING_LABEL = os.environ.get("VLLM_SERVING_LABEL", VLLM_SERVING_APP) +VLLM_SERVING_SCALEDOWN = int(os.environ.get("VLLM_SERVING_SCALEDOWN", "900")) +VLLM_SERVING_STARTUP = int(os.environ.get("VLLM_SERVING_STARTUP", "1200")) +VLLM_SERVING_MAXLEN = int(os.environ.get("VLLM_SERVING_MAXLEN", "16384")) +VLLM_SERVING_HF_SECRET = os.environ.get("VLLM_SERVING_HF_SECRET", "huggingface-secret") +# Pinned vLLM version. The source tree is copied into the build image, where its +# git metadata is not reliably readable by setuptools_scm (and a patched tree is +# "dirty" anyway), so the version is provided explicitly to make the editable +# install deterministic and independent of git state. +VLLM_SERVING_VERSION = os.environ.get("VLLM_SERVING_VERSION", "0.11.0") +# ABI-matched precompiled wheel for the pinned version. With VLLM_USE_PRECOMPILED, +# vLLM's build picks the wheel by deriving a base commit from git; for a shallow/ +# detached source tree that derivation fails and it falls back to a *nightly* +# wheel, whose compiled extensions are ABI-incompatible with the pinned source +# and abort at engine init (std::bad_alloc). Pin the matching release wheel so the +# overlaid .so files match the source. +VLLM_SERVING_PRECOMPILED_WHEEL = os.environ.get( + "VLLM_SERVING_PRECOMPILED_WHEEL", + "https://files.pythonhosted.org/packages/47/33/" + "d19e0763c34392ec956534536fa837c060495bfff31ed83452135ea7608d/" + "vllm-0.11.0-cp38-abi3-manylinux1_x86_64.whl", +) +# vLLM 0.11.0 only lower-bounds transformers (>=4.55.2); pin the CI-tested version +# so the resolver does not pull transformers 5.x (incompatible tokenizer API). +VLLM_SERVING_TRANSFORMERS = os.environ.get("VLLM_SERVING_TRANSFORMERS", "transformers==4.55.2") +VLLM_PORT = 8000 +REMOTE_SRC = "/src/vllm" + +# Persisted caches so weights are downloaded once and reused across cold starts. +hf_cache_vol = modal.Volume.from_name("vllm-serving-opt-hf-cache", create_if_missing=True) +vllm_cache_vol = modal.Volume.from_name("vllm-serving-opt-vllm-cache", create_if_missing=True) + +serving_image = ( + modal.Image.from_registry("nvidia/cuda:12.9.0-devel-ubuntu22.04", add_python="3.12") + .entrypoint([]) + .apt_install("git", "build-essential") + .pip_install("uv") + # Bake the source tree into the image at build time. copy=True is required + # because the next build step (editable install) runs against these files. + .add_local_dir(VLLM_SERVING_SRC, REMOTE_SRC, copy=True) + .run_commands( + # SETUPTOOLS_SCM_PRETEND_VERSION* bypasses git-based version detection, + # which fails for the copied (and possibly patched/dirty) source tree. + # VLLM_PRECOMPILED_WHEEL_LOCATION pins the ABI-matched release wheel (the + # default nightly fallback aborts at engine init). transformers is pinned + # to the CI-tested version; hf_transfer backs HF_HUB_ENABLE_HF_TRANSFER. + f"cd {REMOTE_SRC} && " + f"SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM={VLLM_SERVING_VERSION} " + f"SETUPTOOLS_SCM_PRETEND_VERSION={VLLM_SERVING_VERSION} " + f"VLLM_PRECOMPILED_WHEEL_LOCATION={VLLM_SERVING_PRECOMPILED_WHEEL} " + f"VLLM_USE_PRECOMPILED=1 uv pip install --system -e . " + f"'{VLLM_SERVING_TRANSFORMERS}' hf_transfer", + ) + .env( + { + "HF_HUB_ENABLE_HF_TRANSFER": "1", + "DO_NOT_TRACK": "1", + } + ) +) + +app = modal.App(VLLM_SERVING_APP) + + +def _hf_secrets() -> list[modal.Secret]: + try: + return [modal.Secret.from_name(VLLM_SERVING_HF_SECRET)] + except Exception: + return [] + + +@app.function( + image=serving_image, + gpu=VLLM_SERVING_GPU, + scaledown_window=VLLM_SERVING_SCALEDOWN, + timeout=24 * 60 * 60, + secrets=_hf_secrets(), + volumes={ + "/root/.cache/huggingface": hf_cache_vol, + "/root/.cache/vllm": vllm_cache_vol, + }, +) +@modal.concurrent(max_inputs=64) +@modal.web_server(port=VLLM_PORT, startup_timeout=VLLM_SERVING_STARTUP, label=VLLM_SERVING_LABEL) +def serve() -> None: + import subprocess + + cmd = [ + "vllm", + "serve", + VLLM_SERVING_MODEL, + "--host", + "0.0.0.0", + "--port", + str(VLLM_PORT), + "--max-model-len", + str(VLLM_SERVING_MAXLEN), + "--disable-log-requests", + ] + subprocess.Popen(" ".join(cmd), shell=True) diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/sandbox.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/sandbox.py new file mode 100644 index 00000000..5e31f58c --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/sandbox.py @@ -0,0 +1,141 @@ +"""Per-instance shell sandbox for the agentic workload. + +Each SWE-bench instance runs the agent's shell commands in an isolated sandbox +rooted at the repository working directory. Two backends are supported: + +* ``docker`` – the SWE-bench per-instance testbed image + (``swebench/sweb.eval.x86_64.``) with the repo checked out at + ``/testbed``. This is the faithful backend used by the judge when a Docker + daemon is reachable; it is also what makes a real resolve-rate possible. +* ``local`` – a lightweight temporary directory. Used for fast public-test + feedback (and CI) where Docker-in-Docker is unavailable. Commands run on the + host filesystem inside the temp dir. + +Both backends expose the same ``run(cmd) -> (exit_code, output)`` and +``read_patch()`` interface so the agent loop is backend-agnostic. +""" + +from __future__ import annotations + +import shutil +import subprocess +import tempfile +import uuid +from pathlib import Path + + +def docker_available() -> bool: + if shutil.which("docker") is None: + return False + try: + subprocess.run( + ["docker", "info"], + check=True, + capture_output=True, + timeout=20, + ) + return True + except Exception: + return False + + +def swebench_image(instance_id: str) -> str: + key = instance_id.lower().replace("__", "_1776_") + return f"docker.io/swebench/sweb.eval.x86_64.{key}:latest" + + +class Sandbox: + workdir = "/testbed" + + def run(self, command: str, *, timeout: int) -> tuple[int, str]: + raise NotImplementedError + + def read_patch(self) -> str: + raise NotImplementedError + + def close(self) -> None: + pass + + +class DockerSandbox(Sandbox): + def __init__(self, instance_id: str, *, command_timeout: int = 60) -> None: + self.instance_id = instance_id + self.command_timeout = command_timeout + self.container = f"vllm-serving-opt-{uuid.uuid4().hex[:12]}" + image = swebench_image(instance_id) + subprocess.run( + [ + "docker", + "run", + "-d", + "--name", + self.container, + "--network", + "none", + "-w", + self.workdir, + image, + "sleep", + "infinity", + ], + check=True, + capture_output=True, + text=True, + timeout=600, + ) + + def run(self, command: str, *, timeout: int) -> tuple[int, str]: + try: + proc = subprocess.run( + ["docker", "exec", "-w", self.workdir, self.container, "bash", "-lc", command], + capture_output=True, + text=True, + timeout=timeout, + ) + except subprocess.TimeoutExpired: + return 124, "command timed out" + return proc.returncode, (proc.stdout or "") + (proc.stderr or "") + + def read_patch(self) -> str: + code, out = self.run("git add -A && git diff --cached", timeout=self.command_timeout) + return out if code == 0 else "" + + def close(self) -> None: + subprocess.run(["docker", "rm", "-f", self.container], check=False, capture_output=True) + + +class LocalSandbox(Sandbox): + def __init__(self, instance_id: str) -> None: + self.instance_id = instance_id + self._dir = tempfile.mkdtemp(prefix="vllm-serving-opt-sandbox-") + self.workdir = self._dir + subprocess.run(["git", "init", "-q"], cwd=self._dir, check=False, capture_output=True) + + def run(self, command: str, *, timeout: int) -> tuple[int, str]: + try: + proc = subprocess.run( + ["bash", "-lc", command], + cwd=self._dir, + capture_output=True, + text=True, + timeout=timeout, + ) + except subprocess.TimeoutExpired: + return 124, "command timed out" + return proc.returncode, (proc.stdout or "") + (proc.stderr or "") + + def read_patch(self) -> str: + code, out = self.run("git add -A && git diff --cached", timeout=60) + return out if code == 0 else "" + + def close(self) -> None: + shutil.rmtree(self._dir, ignore_errors=True) + + +def make_sandbox(instance_id: str, *, prefer_docker: bool, command_timeout: int = 60) -> Sandbox: + if prefer_docker and docker_available(): + try: + return DockerSandbox(instance_id, command_timeout=command_timeout) + except Exception: + pass + return LocalSandbox(instance_id) diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/scoring.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/scoring.py new file mode 100644 index 00000000..586fdcbd --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/scoring.py @@ -0,0 +1,60 @@ +"""Scoring math shared with the agent-facing public test. + +The judge's authoritative scorer lives in the task evaluator; these helpers +mirror it so the public test can show a provisional score consistent with how +the judge will grade. Keep the two in sync. +""" + +from __future__ import annotations + +import math + + +def geometric_mean(values: list[float]) -> float: + if not values: + return 0.0 + return math.exp(sum(math.log(max(value, 1e-9)) for value in values) / len(values)) + + +def paired_speedups(baseline: dict[str, float], patched: dict[str, float]) -> list[float]: + speedups: list[float] = [] + for instance_id, patched_value in patched.items(): + base_value = baseline.get(instance_id) + if base_value is None or patched_value <= 0 or base_value <= 0: + continue + speedups.append(max(base_value / patched_value, 0.01)) + return speedups + + +def score_from_speedup(speedup: float) -> float: + if speedup <= 0: + return 0.0 + return max(0.0, min(100.0, 100.0 * math.log(speedup, 2))) + + +def accuracy_multiplier(baseline_accuracy: float, patched_accuracy: float, tolerance: float) -> float: + base = max(baseline_accuracy, 1e-9) + rel_drop = max(0.0, (baseline_accuracy - patched_accuracy) / base) + if rel_drop <= tolerance: + return 1.0 + return max(0.0, min(1.0, tolerance / rel_drop)) + + +def provisional_score( + baseline_latency: dict[str, float], + patched_latency: dict[str, float], + baseline_accuracy: float, + patched_accuracy: float, + tolerance: float, +) -> dict[str, float]: + speedups = paired_speedups(baseline_latency, patched_latency) + gm = geometric_mean(speedups) if speedups else 0.0 + latency_score = score_from_speedup(gm) + acc_mult = accuracy_multiplier(baseline_accuracy, patched_accuracy, tolerance) + return { + "latency_geomean_speedup": gm, + "latency_score": latency_score, + "accuracy_multiplier": acc_mult, + "score": max(0.0, min(100.0, latency_score * acc_mult)), + "instances_scored": float(len(speedups)), + } diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/serving.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/serving.py new file mode 100644 index 00000000..93d51f01 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/serving.py @@ -0,0 +1,200 @@ +"""Deploy, health-check, and tear down a Modal-hosted vLLM server. + +The judge and the public test both build a Modal image from a vLLM source tree +and serve it on an L40S. This module wraps that lifecycle: + + deploy_server(...) -> ServerHandle(base_url, app_name) + wait_healthy(...) + stop_server(...) + +Deployment shells out to the ``modal`` CLI in a fresh process whose environment +selects the source tree / model / app name (see modal_app.py). The public URL +is then resolved through the Modal SDK. +""" + +from __future__ import annotations + +import os +import subprocess +import time +import urllib.error +import urllib.request +from dataclasses import dataclass +from pathlib import Path + +MODAL_APP_MODULE = str(Path(__file__).with_name("modal_app.py")) + +# Substrings that mark a transient Modal control-plane / build failure (image +# build evicted, app stopped mid-deploy, gateway timeout) rather than a real +# build error in the patched source. These are safe to retry. +_TRANSIENT_MODAL_MARKERS = ( + "external shut-down", + "terminated due to external", + "please try again", + "app_state_stopped", + "conflicterror", + "eat_timeout", + "deadline exceeded", + "connection reset", + "502 bad gateway", + "503 service", + "temporarily unavailable", + "timed out", +) + + +def _is_transient_modal_error(text: str) -> bool: + lowered = (text or "").lower() + return any(marker in lowered for marker in _TRANSIENT_MODAL_MARKERS) + + +@dataclass +class ServerHandle: + base_url: str # OpenAI base, e.g. https://...modal.run/v1 + app_name: str + label: str + + +class ServingError(RuntimeError): + pass + + +def _server_env( + *, + src_path: str, + model: str, + gpu: str, + app_name: str, + label: str, + scaledown_seconds: int, + startup_timeout_seconds: int, +) -> dict[str, str]: + env = dict(os.environ) + env.update( + { + "VLLM_SERVING_SRC": src_path, + "VLLM_SERVING_MODEL": model, + "VLLM_SERVING_GPU": gpu, + "VLLM_SERVING_APP": app_name, + "VLLM_SERVING_LABEL": label, + "VLLM_SERVING_SCALEDOWN": str(scaledown_seconds), + "VLLM_SERVING_STARTUP": str(startup_timeout_seconds), + } + ) + return env + + +def _resolve_web_url(app_name: str) -> str: + import modal + + fn = modal.Function.from_name(app_name, "serve") + url = fn.get_web_url() + if not url: + raise ServingError("deployed Modal function does not expose a web URL") + return url.rstrip("/") + + +def deploy_server( + *, + src_path: str, + model: str, + gpu: str, + app_name: str, + label: str, + scaledown_seconds: int, + startup_timeout_seconds: int, + build_timeout_seconds: int, + deploy_retries: int = 3, +) -> ServerHandle: + env = _server_env( + src_path=src_path, + model=model, + gpu=gpu, + app_name=app_name, + label=label, + scaledown_seconds=scaledown_seconds, + startup_timeout_seconds=startup_timeout_seconds, + ) + # Modal's control plane / image builder occasionally evicts a build under load + # (concurrent deploys, transient gateway errors). Those failures are unrelated + # to the patched source, so retry them with a short backoff; a genuine build + # error in the patch is non-transient and fails fast. + attempts = max(1, deploy_retries) + last_error = "modal deploy failed" + for attempt in range(1, attempts + 1): + try: + subprocess.run( + ["modal", "deploy", MODAL_APP_MODULE], + env=env, + check=True, + capture_output=True, + text=True, + timeout=build_timeout_seconds, + ) + base = _resolve_web_url(app_name) + return ServerHandle(base_url=f"{base}/v1", app_name=app_name, label=label) + except subprocess.TimeoutExpired: + last_error = "modal deploy timed out" + transient = True + except subprocess.CalledProcessError as exc: + # Surface only a short, sanitized tail; build logs may contain paths. + tail = (exc.stderr or exc.stdout or "")[-600:] + last_error = f"modal deploy failed: {tail}" + transient = _is_transient_modal_error(tail) + except ServingError as exc: + # _resolve_web_url failed (app not fully registered yet) — treat as transient. + last_error = str(exc) + transient = True + + if not transient or attempt == attempts: + raise ServingError(last_error) + + # Clear any half-created/stopped app state, then back off before retrying. + try: + subprocess.run( + ["modal", "app", "stop", app_name], + check=False, + capture_output=True, + text=True, + timeout=120, + ) + except Exception: + pass + time.sleep(min(45, 10 * attempt)) + + raise ServingError(last_error) + + +def wait_healthy(handle: ServerHandle, *, model: str, timeout_seconds: int) -> None: + """Block until the server answers /v1/models, or raise on timeout.""" + deadline = time.time() + timeout_seconds + models_url = f"{handle.base_url}/models" + last_error: Exception | None = None + while time.time() < deadline: + try: + req = urllib.request.Request(models_url, headers={"Authorization": "Bearer EMPTY"}) + with urllib.request.urlopen(req, timeout=10) as response: + if response.status == 200: + return + except urllib.error.HTTPError as exc: + if exc.code in (401, 403): + return # server is up; auth shape differs + last_error = exc + except Exception as exc: # noqa: BLE001 + last_error = exc + time.sleep(5) + raise ServingError(f"server did not become healthy within {timeout_seconds}s: {last_error}") + + +def stop_server(app_name: str) -> None: + try: + subprocess.run( + ["modal", "app", "stop", app_name], + check=False, + capture_output=True, + text=True, + timeout=120, + ) + except Exception: + # Best-effort teardown; idle containers also scale to zero on their own. + pass diff --git a/2.0/problems/vllm_llm_serving_optimization/serving_eval/settings.py b/2.0/problems/vllm_llm_serving_optimization/serving_eval/settings.py new file mode 100644 index 00000000..9cd45496 --- /dev/null +++ b/2.0/problems/vllm_llm_serving_optimization/serving_eval/settings.py @@ -0,0 +1,117 @@ +"""Configuration for the vLLM serving evaluation harness. + +A single :class:`EvalSettings` is built from the task ``evaluation`` config block +(passed in from the evaluator) with environment-variable fallbacks. The same +settings drive the judge-side measurement and the agent-side public test. +""" + +from __future__ import annotations + +import os +from dataclasses import dataclass, field +from typing import Any + + +def _as_int(value: Any, default: int) -> int: + try: + return int(value) + except Exception: + return default + + +def _as_float(value: Any, default: float) -> float: + try: + return float(value) + except Exception: + return default + + +@dataclass +class EvalSettings: + model: str = "meta-llama/Llama-3.1-8B-Instruct" + gpu: str = "L40S" + dataset: str = "princeton-nlp/SWE-bench_Verified" + dataset_split: str = "test" + public_slice: str = "0:5" + eval_slice: str = "0:30" + arrival_mode: str = "jps" + jps: float = 0.5 + workers: int = 8 + step_limit: int = 50 + temperature: float = 0.0 + max_completion_tokens: int = 2048 + accuracy_tolerance: float = 0.05 + agent_accuracy_mode: str = "patch_validity" + final_accuracy_mode: str = "resolve_rate" + # Docker Hub namespace for prebuilt SWE-bench eval images (real resolve_rate). + swebench_namespace: str = "swebench" + correctness_smoke_prompts: int = 8 + modal_scaledown_seconds: int = 900 + modal_startup_timeout_seconds: int = 1200 + modal_deploy_retries: int = 3 + server_health_timeout_seconds: int = 1800 + build_timeout_seconds: int = 5400 + instance_timeout_seconds: int = 1200 + extra: dict[str, Any] = field(default_factory=dict) + + @classmethod + def from_config(cls, config: dict[str, Any] | None) -> "EvalSettings": + config = dict(config or {}) + return cls( + model=str(config.get("model", cls.model)), + gpu=str(config.get("gpu", cls.gpu)), + dataset=str(config.get("dataset", cls.dataset)), + dataset_split=str(config.get("dataset_split", cls.dataset_split)), + public_slice=str(config.get("public_slice", cls.public_slice)), + eval_slice=str(config.get("eval_slice", cls.eval_slice)), + arrival_mode=str(config.get("arrival_mode", cls.arrival_mode)), + jps=_as_float(config.get("jps"), cls.jps), + workers=_as_int(config.get("workers"), cls.workers), + step_limit=_as_int(config.get("step_limit"), cls.step_limit), + temperature=_as_float(config.get("temperature"), cls.temperature), + max_completion_tokens=_as_int(config.get("max_completion_tokens"), cls.max_completion_tokens), + accuracy_tolerance=_as_float(config.get("accuracy_tolerance"), cls.accuracy_tolerance), + agent_accuracy_mode=str(config.get("agent_accuracy_mode", cls.agent_accuracy_mode)), + final_accuracy_mode=str(config.get("final_accuracy_mode", cls.final_accuracy_mode)), + swebench_namespace=str(config.get("swebench_namespace", cls.swebench_namespace)), + correctness_smoke_prompts=_as_int( + config.get("correctness_smoke_prompts"), cls.correctness_smoke_prompts + ), + modal_scaledown_seconds=_as_int(config.get("modal_scaledown_seconds"), cls.modal_scaledown_seconds), + modal_deploy_retries=_as_int(config.get("modal_deploy_retries"), cls.modal_deploy_retries), + modal_startup_timeout_seconds=_as_int( + config.get("modal_startup_timeout_seconds"), cls.modal_startup_timeout_seconds + ), + server_health_timeout_seconds=_as_int( + config.get("server_health_timeout_seconds"), cls.server_health_timeout_seconds + ), + build_timeout_seconds=_as_int(config.get("build_timeout_seconds"), cls.build_timeout_seconds), + instance_timeout_seconds=_as_int(config.get("instance_timeout_seconds"), cls.instance_timeout_seconds), + extra=config, + ) + + def slice_for_role(self, role: str) -> str: + return self.eval_slice if role == "final" else self.public_slice + + def accuracy_mode_for_role(self, role: str) -> str: + return self.final_accuracy_mode if role == "final" else self.agent_accuracy_mode + + +def parse_slice(spec: str, length: int) -> range: + """Parse a ``start:stop`` slice spec into a concrete index range.""" + spec = (spec or "").strip() + if not spec: + return range(length) + parts = spec.split(":") + try: + start = int(parts[0]) if parts[0] else 0 + stop = int(parts[1]) if len(parts) > 1 and parts[1] else length + except ValueError: + return range(length) + start = max(0, min(start, length)) + stop = max(start, min(stop, length)) + return range(start, stop) + + +def modal_available() -> bool: + return bool(os.environ.get("MODAL_TOKEN_ID") and os.environ.get("MODAL_TOKEN_SECRET"))