Skip to content

cameronmpalmer/local-coding-agent-testing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local Coding Agent Testing

Automated testing framework for evaluating local LLM coding agents with Ollama and OpenCode on a Tesla P40 (24GB VRAM).

📄 Full Implementation Plan

Hardware

  • GPU: NVIDIA Tesla P40 (24GB VRAM)
  • Server: Ubuntu (hostname: banana)
  • Ollama: v0.15.1

Goal

Maximize coding agent performance from local models running essentially for free, using:

  • OpenCode as the agent framework
  • Ollama for model serving
  • Prompt optimization techniques

Top Models for Agentic Coding (January 2026)

Model Size SWE-bench Notes
devstral-small-2 15GB 65.8% Purpose-built for agentic coding
glm-4.7-flash 19GB 59.2% Strongest in 30B class
qwen3-coder:30b 19GB - Best for long context
rnj-1 (8B) 5.1GB 20.8% Best small model

Currently Installed Models

glm-4.7-flash:q4_K_M     19 GB
qwen3-coder:30b          18 GB
gpt-oss:20b              13 GB
deepseek-coder-v2:16b    8.9 GB
qwen3:8b                 5.0 GB
llama3:8b                4.7 GB

Testing Framework

A Go CLI tool (agent-bench) for automated evaluation of coding agents.

📄 Go CLI Implementation Plan

local-coding-agent-testing/
├── cmd/agent-bench/       # CLI entry point
├── internal/              # Go packages
│   ├── cli/               # Cobra commands
│   ├── config/            # Configuration loading
│   ├── task/              # Task handling
│   ├── runner/            # OpenCode execution
│   ├── eval/              # Evaluation engine
│   └── report/            # Report generation
├── config/                # Model and framework settings
├── tasks/                 # Task definitions with prompts
├── projects/              # Sandbox projects for testing
├── results/               # Test outputs (gitignored)
└── reports/               # Generated comparison reports

Key Features

  • Single binary: No runtime dependencies
  • Headless execution: Uses opencode run --format json
  • Multi-model comparison: Test identical tasks across models
  • Automatic evaluation: Syntax checking, test execution
  • Rich reports: HTML/JSON/Markdown output

Quick Start

# Build the CLI
make build

# Run single task with one model
agent-bench run -t cg-001 -m devstral-small-2

# Run full suite with multiple models
agent-bench run -s all -m devstral-small-2 -m glm-4.7-flash

# Generate comparison report
agent-bench report --run latest --format html

# List available tasks
agent-bench list tasks

Testing Plan

  1. Pull recommended models: devstral-small-2, rnj-1
  2. Create benchmark tasks: Real-world coding scenarios
  3. Test with OpenCode: Each model under identical conditions
  4. Optimize prompts: Per-model tuning for best tool use
  5. Document results: Performance, latency, quality

Key Findings

  • MoE models excel: GLM-4.7-Flash and Qwen3-Coder use MoE (30B params, ~3B activated)
  • Quantization required: Q4_K_M needed for 20B+ models on 24GB
  • Tool calling is critical: Native tool/function calling support essential
  • Devstral Small 2: Purpose-built for exactly this use case

Resources

About

Testing local LLM coding agents with Ollama and OpenCode

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages