Automated testing framework for evaluating local LLM coding agents with Ollama and OpenCode on a Tesla P40 (24GB VRAM).
- GPU: NVIDIA Tesla P40 (24GB VRAM)
- Server: Ubuntu (hostname:
banana) - Ollama: v0.15.1
Maximize coding agent performance from local models running essentially for free, using:
- OpenCode as the agent framework
- Ollama for model serving
- Prompt optimization techniques
| Model | Size | SWE-bench | Notes |
|---|---|---|---|
| devstral-small-2 | 15GB | 65.8% | Purpose-built for agentic coding |
| glm-4.7-flash | 19GB | 59.2% | Strongest in 30B class |
| qwen3-coder:30b | 19GB | - | Best for long context |
| rnj-1 (8B) | 5.1GB | 20.8% | Best small model |
glm-4.7-flash:q4_K_M 19 GB
qwen3-coder:30b 18 GB
gpt-oss:20b 13 GB
deepseek-coder-v2:16b 8.9 GB
qwen3:8b 5.0 GB
llama3:8b 4.7 GB
A Go CLI tool (agent-bench) for automated evaluation of coding agents.
local-coding-agent-testing/
├── cmd/agent-bench/ # CLI entry point
├── internal/ # Go packages
│ ├── cli/ # Cobra commands
│ ├── config/ # Configuration loading
│ ├── task/ # Task handling
│ ├── runner/ # OpenCode execution
│ ├── eval/ # Evaluation engine
│ └── report/ # Report generation
├── config/ # Model and framework settings
├── tasks/ # Task definitions with prompts
├── projects/ # Sandbox projects for testing
├── results/ # Test outputs (gitignored)
└── reports/ # Generated comparison reports
- Single binary: No runtime dependencies
- Headless execution: Uses
opencode run --format json - Multi-model comparison: Test identical tasks across models
- Automatic evaluation: Syntax checking, test execution
- Rich reports: HTML/JSON/Markdown output
# Build the CLI
make build
# Run single task with one model
agent-bench run -t cg-001 -m devstral-small-2
# Run full suite with multiple models
agent-bench run -s all -m devstral-small-2 -m glm-4.7-flash
# Generate comparison report
agent-bench report --run latest --format html
# List available tasks
agent-bench list tasks- Pull recommended models:
devstral-small-2,rnj-1 - Create benchmark tasks: Real-world coding scenarios
- Test with OpenCode: Each model under identical conditions
- Optimize prompts: Per-model tuning for best tool use
- Document results: Performance, latency, quality
- MoE models excel: GLM-4.7-Flash and Qwen3-Coder use MoE (30B params, ~3B activated)
- Quantization required: Q4_K_M needed for 20B+ models on 24GB
- Tool calling is critical: Native tool/function calling support essential
- Devstral Small 2: Purpose-built for exactly this use case