Local Coding Agent Testing

Automated testing framework for evaluating local LLM coding agents with Ollama and OpenCode on a Tesla P40 (24GB VRAM).

📄 Full Implementation Plan

Hardware

GPU: NVIDIA Tesla P40 (24GB VRAM)
Server: Ubuntu (hostname: banana)
Ollama: v0.15.1

Goal

Maximize coding agent performance from local models running essentially for free, using:

OpenCode as the agent framework
Ollama for model serving
Prompt optimization techniques

Top Models for Agentic Coding (January 2026)

Model	Size	SWE-bench	Notes
devstral-small-2	15GB	65.8%	Purpose-built for agentic coding
glm-4.7-flash	19GB	59.2%	Strongest in 30B class
qwen3-coder:30b	19GB	-	Best for long context
rnj-1 (8B)	5.1GB	20.8%	Best small model

Currently Installed Models

glm-4.7-flash:q4_K_M     19 GB
qwen3-coder:30b          18 GB
gpt-oss:20b              13 GB
deepseek-coder-v2:16b    8.9 GB
qwen3:8b                 5.0 GB
llama3:8b                4.7 GB

Testing Framework

A Go CLI tool (agent-bench) for automated evaluation of coding agents.

📄 Go CLI Implementation Plan

local-coding-agent-testing/
├── cmd/agent-bench/       # CLI entry point
├── internal/              # Go packages
│   ├── cli/               # Cobra commands
│   ├── config/            # Configuration loading
│   ├── task/              # Task handling
│   ├── runner/            # OpenCode execution
│   ├── eval/              # Evaluation engine
│   └── report/            # Report generation
├── config/                # Model and framework settings
├── tasks/                 # Task definitions with prompts
├── projects/              # Sandbox projects for testing
├── results/               # Test outputs (gitignored)
└── reports/               # Generated comparison reports

Key Features

Single binary: No runtime dependencies
Headless execution: Uses opencode run --format json
Multi-model comparison: Test identical tasks across models
Automatic evaluation: Syntax checking, test execution
Rich reports: HTML/JSON/Markdown output

Quick Start

# Build the CLI
make build

# Run single task with one model
agent-bench run -t cg-001 -m devstral-small-2

# Run full suite with multiple models
agent-bench run -s all -m devstral-small-2 -m glm-4.7-flash

# Generate comparison report
agent-bench report --run latest --format html

# List available tasks
agent-bench list tasks

Testing Plan

Pull recommended models: devstral-small-2, rnj-1
Create benchmark tasks: Real-world coding scenarios
Test with OpenCode: Each model under identical conditions
Optimize prompts: Per-model tuning for best tool use
Document results: Performance, latency, quality

Key Findings

MoE models excel: GLM-4.7-Flash and Qwen3-Coder use MoE (30B params, ~3B activated)
Quantization required: Q4_K_M needed for 20B+ models on 24GB
Tool calling is critical: Native tool/function calling support essential
Devstral Small 2: Purpose-built for exactly this use case

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
docs		docs
tasks		tasks
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local Coding Agent Testing

Hardware

Goal

Top Models for Agentic Coding (January 2026)

Currently Installed Models

Testing Framework

Key Features

Quick Start

Testing Plan

Key Findings

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local Coding Agent Testing

Hardware

Goal

Top Models for Agentic Coding (January 2026)

Currently Installed Models

Testing Framework

Key Features

Quick Start

Testing Plan

Key Findings

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages