BenchLoop Web

The web surface for BenchLoop — a local-first benchmark suite for LLM models that scores quality, speed, and reliability across seven fixed task suites (speed, toolcall, coding, dataextract, instructfollow, reasonmath, agent).

Pick a model on any reachable endpoint (Ollama, LM Studio, Osaurus, vLLM, oMLX, Jan, or any OpenAI-compatible server), pick the suites, hit Run, watch live progress, then compare results in the leaderboard.

Architecture

bench-loop-web/
  api/    FastAPI app (uvicorn) wrapping the bench-loop runner
  ui/     React + Vite frontend

The API delegates to bench-loop/ (sibling repo) for the actual benchmark logic. Runs are persisted to ~/.bench-loop/runs/ so they survive restarts and show up in the leaderboard from disk.

Quick start (dev)

Two long-running processes:

# 1. API (port 8877)
cd bench-loop-web/api
PYTHONPATH=/Users/aurora/.ocplatform/workspace/bench-loop \
BENCH_LOOP_DIR=/Users/aurora/.ocplatform/workspace/bench-loop \
  /Users/aurora/.ocplatform/workspace/bench-loop/.venv/bin/python \
  -m uvicorn main:app --host 127.0.0.1 --port 8877 --app-dir .

# 2. UI (port 5180)
cd bench-loop-web/ui
npm install
npx vite --host 127.0.0.1 --port 5180

Open http://127.0.0.1:5180/.

Pages

Path	Purpose
`/` `/models`	Auto-detect local providers, browse model catalog, jump to benchmark
`/chat`	Quick chat against any reachable model
`/benchmark`	Pick model + suites + harness, run with live progress
`/leaderboard`	Best run per model+harness, rank by overall/quality/speed/tok-s/efficiency. Click row for detail, hit Compare per row
`/runs/:runId`	Full per-suite scores, speed metrics, machine info, raw JSON
`/compare?a=&b=`	Two runs side-by-side with deltas across every metric
`/stacks`	Stack-oriented context-window leaderboard

API endpoints

Route	What
`GET /api/health`	Liveness
`GET /api/hardware`	Local machine info (CPU, GPU, memory)
`GET /api/models?endpoint=...`	List models. If endpoint omitted, auto-probe localhost for Ollama (11434), LM Studio (1234), oMLX/Osaurus (8000), Jan (1337), vLLM (8080)
`GET /api/models/preflight?endpoint=...&model=...`	Verify a model can actually load
`GET /api/models/search-hf?q=&limit=`	Search Hugging Face
`GET /api/models/hf-details?repo=`	HF repo metadata
`POST /api/models/pull`	Trigger a model pull
`GET /api/models/pull/active`	List in-flight pulls
`GET /api/models/pull/{id}/stream`	SSE for pull progress
`POST /api/benchmark/run`	Start a benchmark. Body: `{model, endpoint, provider, suites[], harness}`
`GET /api/benchmark/runs`	List persisted runs with v2 speed-score recompute
`GET /api/benchmark/runs/{runId}`	Run detail (active or persisted)
`GET /api/benchmark/stream/{runId}`	SSE for live progress
`POST /api/chat/generate`	Passthrough chat completion

Providers

Provider type is auto-detected per model and passed to the runner:

ollama — Ollama's /api/chat (default for http://localhost:11434 and any tunnelled Ollama)
openai_compat — Any OpenAI-compatible /v1/chat/completions: LM Studio, vLLM, Osaurus/MLX, Jan, oMLX, hosted endpoints

The UI's BenchmarkTab picks the correct provider based on the chosen model's source — no manual selection needed.

Harnesses

Wrap the same task in different prompt/parse contracts so you can A/B "this model with raw tools" vs "this model with Hermes tags":

raw — vanilla OpenAI-style tools, no prompt rewriting
hermes — NousResearch <tool_call>{...}</tool_call> XML tags
qwen — Qwen3 <function_call>{...}</function_call> tags
pi — OpenClaw/Pi-style <think>...</think> + Hermes tags

What ships in v1

✅ Six fixed task suites, deterministic + reproducible
✅ Live SSE progress per task
✅ Provider auto-detect (Ollama + OpenAI-compatible)
✅ Run persistence + leaderboard from disk
✅ Per-run detail + side-by-side compare
✅ Speed-score v2 curve (anchored on real M-series/RTX reference points)
✅ Preflight model-load check with actionable diagnostics
⏳ True streaming TTFT (currently 0 for openai_compat; requires streaming pass)
⏳ Hosted leaderboard at bench-loop.com
⏳ Community submission flow

License

TBD before the public launch.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.vite/deps_temp_92056acc		.vite/deps_temp_92056acc
api		api
docs		docs
site		site
ui		ui
worker		worker
.gitignore		.gitignore
README.md		README.md
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenchLoop Web

Architecture

Quick start (dev)

Pages

API endpoints

Providers

Harnesses

What ships in v1

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BenchLoop Web

Architecture

Quick start (dev)

Pages

API endpoints

Providers

Harnesses

What ships in v1

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages