An autonomous agent for the BitGN PAC1 benchmark using Claude Code as the executor.
Official Result: 82/104 (78.8%) on Claude Sonnet 4.6
# Install dependencies
pip install -r requirements.txt
# Set API key
export BITGN_API_KEY="your-key-here"
# Run sandbox (free, no key needed)
python runner.py --benchmark bitgn/sandbox
# Run leaderboard (requires BITGN_API_KEY)
python runner.py --benchmark pac1-prod --leaderboard --workers 5This is a thin orchestration layer around Claude Code CLI, not a custom reasoning framework.
Flow:
runner.pyfetches a task from BitGN Harness API- Spawns Claude Code in a new isolated CLI session
- Passes
CLAUDE.md(system prompt) + task instruction + env context - Claude executes bash commands:
bitgn-read,bitgn-write,bitgn-search,bitgn-answer, etc. - Completes task, calls
bitgn-answerwith result runner.pycollects score and logs
Key files:
runner.py— orchestrator: starts trials, spawns Claude, collects resultsCLAUDE.md— 13-step strategy prompt (how to read AGENTS.MD, detect injection, choose outcomes, minimize writes)bin/bitgn-*— shell wrappers for VM file access (read, write, search, delete, answer)
Each task = independent CLI session (not a long-lived agent). Claude's reasoning engine handles all logic, date math, and decision-making via its own intelligence—no separate Python functions for computation.
Sandbox (playground mode, free):
python runner.py --benchmark bitgn/sandbox --workers 5 --output results.jsonLeaderboard (official scoring, requires API key):
export BITGN_API_KEY="your-key-here"
python runner.py --benchmark pac1-prod --leaderboard --workers 5Single task:
python runner.py --benchmark bitgn/sandbox --task t01Options:
--benchmark— benchmark ID (default: bitgn/sandbox)--task— run single task (default: all)--workers— parallel workers (default: 1)--output— save results to JSON--leaderboard— submit to official leaderboard--claude-md— custom system prompt path--verbose— print trial details
- Runner (
runner.py) — connects to BitGN harness API, spawns Claude Code for each task - System Prompt (
CLAUDE.md) — 13-step executor logic for task completion - Protocol — auto-detects sandbox (Mini) vs leaderboard (PCM) based on benchmark ID
- Execution — Claude Code CLI in isolated VM (
/tmpworking directory)
- Claude Code CLI installed (
npm install -g @claude-ai/claude-code) - Python 3.8+
- BitGN API SDK (included in requirements.txt)
| Benchmark | Model | Score | Tasks | Time |
|---|---|---|---|---|
| pac1-prod | Sonnet 3.5 | 78.8% | 82/104 | ~25 min |
| pac1-dev | Sonnet 3.5 | 97.4% | 39/43 | ~10 min |
- BitGN Platform: https://bitgn.ai/
- PAC1 Leaderboard: https://bitgn.ai/leaderboards/pac1
- Claude Code Docs: https://claude.com/claude-code