This directory contains local Terminal-Bench experiments and Harbor adapter code for Open Agent SDK.
Run terminal-bench@2.0 reliably with Harbor, then inspect reproducible artifacts under jobs/.
benchmark/terminalbench/open_agent_sdk_harbor/: Harbor agent adapter and install scripts.benchmark/terminalbench/test-tasks/: local hello-world style task.benchmark/terminalbench/jobs/: historical local benchmark outputs.docs/workflows/terminal-bench-harbor-runbook.md: extended troubleshooting notes.
From repo root:
pip install harbor
ln -sf "$(pwd)/benchmark/terminalbench/open_agent_sdk_harbor/agent.py" \
"$(python -c 'import harbor; print(harbor.__path__[0])')/agents/installed/open_agent_sdk.py"
set -a
source .env
set +aRequired env for MiniMax Anthropic-compatible mode:
ANTHROPIC_API_KEYANTHROPIC_BASE_URL
env -u https_proxy -u http_proxy -u all_proxy \
-u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run -d terminal-bench@2.0 \
--env docker \
--agent-import-path "harbor.agents.installed.open_agent_sdk:OpenAgentSDKAgent" \
--model MiniMax-M2.5 \
--ae ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
--ae ANTHROPIC_BASE_URL="$ANTHROPIC_BASE_URL" \
--ae OAS_HARBOR_SAVE_TRAJECTORY=1 \
--task-name "fix-git" \
--n-concurrent 1 \
--timeout-multiplier 3.0 \
--override-memory-mb 4096Notes:
--override-memory-mb 4096is for debugging stability, not leaderboard submission.- Keep proxy vars unset for Harbor run if host proxy is
127.0.0.1.
env -u https_proxy -u http_proxy -u all_proxy \
-u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run -d terminal-bench@2.0 \
--env docker \
--agent-import-path "harbor.agents.installed.open_agent_sdk:OpenAgentSDKAgent" \
--model MiniMax-M2.5 \
--ae ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
--ae ANTHROPIC_BASE_URL="$ANTHROPIC_BASE_URL" \
--n-concurrent 4Use the provided runner to execute tasks sequentially while automatically cleaning old terminal-bench images between batches.
Scripts:
benchmark/terminalbench/scripts/run-terminalbench-overnight.shbenchmark/terminalbench/scripts/cleanup-terminalbench-images.shbenchmark/terminalbench/scripts/pack-local-tarballs.sh
To pre-install the local SDK/CLI build into cached task images and avoid repeated registry installs on every trial:
./benchmark/terminalbench/prewarm-images.sh \
--tasks-file benchmark/terminalbench/task-lists/smoke-5.txt \
--pack-local-tarballsNotes:
- This builds fresh local tarballs, serves them temporarily on
host.docker.internal, and bakesbun,oas,uv, andpytestinto each task image. - Override
--tarball-hostor--tarball-portif your Docker runtime cannot reachhost.docker.internal:8765. - After pre-warm, normal Harbor runs reuse the pre-installed
oasfast path and do not needOAS_LOCAL_TARBALL_URL.
The overnight runner always sources main workspace .env (via git common dir), so it works correctly even when executed from a git worktree.
Example (sleep-safe smoke run):
chmod +x benchmark/terminalbench/scripts/*.sh
./benchmark/terminalbench/scripts/run-terminalbench-overnight.sh \
--tasks-file benchmark/terminalbench/task-lists/smoke-5.txt \
--batch-size 0 \
--task-repeats 1 \
--agent-timeout-multiplier 1.0run-terminalbench-overnight.sh now defaults to --batch-size 0, so terminal-bench
images are kept unless you explicitly opt into periodic cleanup.
Image cleanup only (manual):
# Preview
./benchmark/terminalbench/scripts/cleanup-terminalbench-images.sh --dry-run --keep 2
# Apply
./benchmark/terminalbench/scripts/cleanup-terminalbench-images.sh --keep 2latest="$(ls -1dt jobs/* | head -n 1)"
find "$latest" -maxdepth 5 -type f | \
grep -E 'result.json|return-code.txt|stdout.txt|stderr.txt|trial.log|open-agent-transcript'Key files:
jobs/<run>/result.jsonjobs/<run>/<trial>/result.jsonjobs/<run>/<trial>/agent/setup/stdout.txtjobs/<run>/<trial>/agent/command-0/return-code.txtjobs/<run>/<trial>/agent/command-0/stderr.txtjobs/<run>/<trial>/agent/open-agent-transcript/sessions-index.jsonjobs/<run>/<trial>/agent/open-agent-transcript/*.jsonl
return code 137: container OOM kill. Increase memory only for debugging.- Setup fails while installing CLI: adapter install script now has npm registry fallback (
npmjsthennpmmirror). - MiniMax region mismatch:
api.minimaxi.comandapi.minimax.ioare different endpoint domains.- A key valid on one region endpoint may fail on the other.
- Daytona + MiniMax currently observed
ECONNRESETfrom sandbox egress to MiniMax endpoints in this environment.
Daytona environment can start tasks, but MiniMax calls from sandbox showed repeated ECONNRESET during this debugging cycle. Use Docker for stable MiniMax runs until daytona network path is fixed.
benchmark/terminalbench/open_agent_sdk_harbor/README.mddocs/workflows/terminal-bench-harbor-runbook.mddocs/research/harbor-137-debugging-handoff.md