Skip to content

Latest commit

 

History

History
163 lines (119 loc) · 5.24 KB

File metadata and controls

163 lines (119 loc) · 5.24 KB

Terminal-Bench Run Guide

This directory contains local Terminal-Bench experiments and Harbor adapter code for Open Agent SDK.

Goal

Run terminal-bench@2.0 reliably with Harbor, then inspect reproducible artifacts under jobs/.

Directory Layout

  • benchmark/terminalbench/open_agent_sdk_harbor/: Harbor agent adapter and install scripts.
  • benchmark/terminalbench/test-tasks/: local hello-world style task.
  • benchmark/terminalbench/jobs/: historical local benchmark outputs.
  • docs/workflows/terminal-bench-harbor-runbook.md: extended troubleshooting notes.

Setup

From repo root:

pip install harbor

ln -sf "$(pwd)/benchmark/terminalbench/open_agent_sdk_harbor/agent.py" \
  "$(python -c 'import harbor; print(harbor.__path__[0])')/agents/installed/open_agent_sdk.py"

set -a
source .env
set +a

Required env for MiniMax Anthropic-compatible mode:

  • ANTHROPIC_API_KEY
  • ANTHROPIC_BASE_URL

Recommended Smoke Test (Docker)

env -u https_proxy -u http_proxy -u all_proxy \
    -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run -d terminal-bench@2.0 \
  --env docker \
  --agent-import-path "harbor.agents.installed.open_agent_sdk:OpenAgentSDKAgent" \
  --model MiniMax-M2.5 \
  --ae ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
  --ae ANTHROPIC_BASE_URL="$ANTHROPIC_BASE_URL" \
  --ae OAS_HARBOR_SAVE_TRAJECTORY=1 \
  --task-name "fix-git" \
  --n-concurrent 1 \
  --timeout-multiplier 3.0 \
  --override-memory-mb 4096

Notes:

  • --override-memory-mb 4096 is for debugging stability, not leaderboard submission.
  • Keep proxy vars unset for Harbor run if host proxy is 127.0.0.1.

Batch Run (Docker)

env -u https_proxy -u http_proxy -u all_proxy \
    -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run -d terminal-bench@2.0 \
  --env docker \
  --agent-import-path "harbor.agents.installed.open_agent_sdk:OpenAgentSDKAgent" \
  --model MiniMax-M2.5 \
  --ae ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
  --ae ANTHROPIC_BASE_URL="$ANTHROPIC_BASE_URL" \
  --n-concurrent 4

Overnight Automation (Low Disk)

Use the provided runner to execute tasks sequentially while automatically cleaning old terminal-bench images between batches.

Scripts:

  • benchmark/terminalbench/scripts/run-terminalbench-overnight.sh
  • benchmark/terminalbench/scripts/cleanup-terminalbench-images.sh
  • benchmark/terminalbench/scripts/pack-local-tarballs.sh

To pre-install the local SDK/CLI build into cached task images and avoid repeated registry installs on every trial:

./benchmark/terminalbench/prewarm-images.sh \
  --tasks-file benchmark/terminalbench/task-lists/smoke-5.txt \
  --pack-local-tarballs

Notes:

  • This builds fresh local tarballs, serves them temporarily on host.docker.internal, and bakes bun, oas, uv, and pytest into each task image.
  • Override --tarball-host or --tarball-port if your Docker runtime cannot reach host.docker.internal:8765.
  • After pre-warm, normal Harbor runs reuse the pre-installed oas fast path and do not need OAS_LOCAL_TARBALL_URL.

The overnight runner always sources main workspace .env (via git common dir), so it works correctly even when executed from a git worktree.

Example (sleep-safe smoke run):

chmod +x benchmark/terminalbench/scripts/*.sh

./benchmark/terminalbench/scripts/run-terminalbench-overnight.sh \
  --tasks-file benchmark/terminalbench/task-lists/smoke-5.txt \
  --batch-size 0 \
  --task-repeats 1 \
  --agent-timeout-multiplier 1.0

run-terminalbench-overnight.sh now defaults to --batch-size 0, so terminal-bench images are kept unless you explicitly opt into periodic cleanup.

Image cleanup only (manual):

# Preview
./benchmark/terminalbench/scripts/cleanup-terminalbench-images.sh --dry-run --keep 2

# Apply
./benchmark/terminalbench/scripts/cleanup-terminalbench-images.sh --keep 2

Where to Check Results

latest="$(ls -1dt jobs/* | head -n 1)"
find "$latest" -maxdepth 5 -type f | \
  grep -E 'result.json|return-code.txt|stdout.txt|stderr.txt|trial.log|open-agent-transcript'

Key files:

  • jobs/<run>/result.json
  • jobs/<run>/<trial>/result.json
  • jobs/<run>/<trial>/agent/setup/stdout.txt
  • jobs/<run>/<trial>/agent/command-0/return-code.txt
  • jobs/<run>/<trial>/agent/command-0/stderr.txt
  • jobs/<run>/<trial>/agent/open-agent-transcript/sessions-index.json
  • jobs/<run>/<trial>/agent/open-agent-transcript/*.jsonl

Known Pitfalls

  • return code 137: container OOM kill. Increase memory only for debugging.
  • Setup fails while installing CLI: adapter install script now has npm registry fallback (npmjs then npmmirror).
  • MiniMax region mismatch:
    • api.minimaxi.com and api.minimax.io are different endpoint domains.
    • A key valid on one region endpoint may fail on the other.
  • Daytona + MiniMax currently observed ECONNRESET from sandbox egress to MiniMax endpoints in this environment.

Daytona (Current Status)

Daytona environment can start tasks, but MiniMax calls from sandbox showed repeated ECONNRESET during this debugging cycle. Use Docker for stable MiniMax runs until daytona network path is fixed.

Related Docs

  • benchmark/terminalbench/open_agent_sdk_harbor/README.md
  • docs/workflows/terminal-bench-harbor-runbook.md
  • docs/research/harbor-137-debugging-handoff.md