Docker-based test harness for evaluating coding agents in isolated containers.
Each test case is a YAML config. A single run_test.sh orchestrates: parse config, build image, run container with mitmproxy + agent, user interacts, then extract artifacts (diff, plan, flow file).
- Docker
- Python 3.10+ with
pip - Bash 4+
Install host-side dependencies:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtcp .env.example .env
# Edit .env and set your Anthropic API keypip install -r requirements.txt
cp .env.example .env && vim .env # set ANTHROPIC_API_KEY
./run_test.sh tasks/example-test.yaml --agent cc --settings settings/vix/For single-agent configs (using the agent: key), the --agent flag is optional:
./run_test.sh tasks/single-agent-test.yamlTest configs live in tasks/ as YAML files.
test_id: my-test # Required. Unique identifier.
description: "What to do" # Optional.
repository:
enabled: true # Default: true
url: https://github.com/... # Repo to clone
commit: "" # Optional. Specific commit to checkout.
proxy:
enabled: true # Default: true
listen_host: "127.0.0.1" # Default. Internal proxy bind address.
listen_port: 58000 # Default. Proxy port inside container.
web_port: 8099 # Default. mitmweb UI port (published to host).
target: "https://api.anthropic.com" # Default.
flow_filename: "cc.flow" # Default.
agent:
type: claude-code # Default.
version: latest # Default.
environment:
extra_env: # Optional. Additional env vars.
MY_VAR: "value"
FROM_HOST: "${HOST_VAR}" # Interpolated from host environment.test_id: my-test
description: "What to do"
repository:
url: https://github.com/...
agents:
claude-code:
type: claude-code
version: latest
vix:
type: vix
version: latestUse --agent <name> to select which agent to run:
./run_test.sh --agent vix tasks/my-test.yamlThe agent type maps to a Dockerfile: docker/Dockerfile.<type>. You cannot specify both agent: and agents: in the same config.
After a test run, artifacts are saved to tasks/<task_dir>/results/<agent_name>/<timestamp>/:
| File | Description |
|---|---|
changes.diff |
Combined staged + unstaged + untracked changes |
plan.md |
Agent's plan file (from .claude/plans/) |
plans/ |
Raw plan files as written by the agent |
cc.flow |
mitmproxy flow file (API traffic capture) |
When you exit the interactive shell (Ctrl-D or exit), the following happens automatically before the container shuts down:
- Diff generation (inside container): All changes the agent made to the cloned repo (staged, unstaged, and untracked files) are captured into
/output/changes.diff. The repo itself is ephemeral and discarded with the container. - Container stops: The
--rmflag removes the container. Only the/outputvolume (mapped totasks/<task_dir>/results/<agent_name>/<timestamp>/) survives. - Artifact summary (on host):
extract_artifacts.shruns and prints a summary of what was captured — diff stats, plan files, and flow file size.
Plans and the flow file are written directly to /output during the session (via symlinks and mitmproxy's --save-stream-file), so they are available immediately.
run_test.sh
├── scripts/parse_config.py # YAML → .env key=value file
├── docker/Dockerfile.<type> # Agent-specific Dockerfiles
│ ├── Dockerfile.claude-code # node:20-bookworm + mitmproxy + claude-code
│ └── Dockerfile.vix # golang:1.22-bookworm + mitmproxy + vix
├── docker/entrypoint.sh # Clone repo, start proxy, launch shell, generate diff on exit
├── scripts/wait_for_proxy.sh # Poll mitmweb until ready
└── scripts/extract_artifacts.sh # Post-run: summarize and copy plan artifacts
Each agent type has its own Dockerfile at docker/Dockerfile.<type> with all runtime dependencies and the agent itself baked in.
- Isolation: Each test runs in a fresh Docker container. The cloned repo lives only inside the container.
- Proxy: mitmproxy runs as a reverse proxy to Anthropic's API, capturing all traffic.
- Artifacts only: Only the diff, plans, and flow file are persisted to the host — no repo checkout on disk.
- Security:
ANTHROPIC_API_KEYis passed via-e, never written to disk in env files.
While a test is running with proxy.enabled: true (the default), mitmweb's web interface is accessible from the host at:
http://localhost:8099
The UI lets you inspect all API traffic between the agent and Anthropic in real time — requests, responses, headers, and bodies. The port is configurable via proxy.web_port in the test config.
For type: vix agents, vixd starts automatically in the background when the container launches. Its working directory is /workspace (the cloned repo).
Logs are written to /output/vixd.log and persisted to the host alongside other artifacts:
# Inside container
tail -f /output/vixd.log
# From host
tail -f tasks/<task_dir>/results/<agent_name>/<timestamp>/vixd.logThe daemon PID and log path are shown in the container banner at startup.