feat(eval): online trace scorer — nuvel eval (v1) by Folken2 · Pull Request #35 · Folken2/nuvel

Folken2 · 2026-05-20T20:28:32Z

Summary

First end-to-end implementation of the eval harness per spec 2026-05-20-eval-harness-v1-design.md and plan 2026-05-20-eval-harness.md. Online trace scorer: heuristics-first, judge-on-pass, scored.jsonl siblings, full nuvel eval CLI, dashboard score column, drift detection.

Heuristics (8 flags): tool_error, llm_error, no_assistant_output, incomplete_trace, excessive_turns, tool_loop, cost_outlier, latency_outlier, token_bloat — deterministic, pure functions over Run.events.
Judge: litellm.acompletion via an injection seam (tests never make real network calls). Model resolves: rubric → EVAL_JUDGE_MODEL env → DEFAULT_FAST_MODEL. Tolerant JSON parse, one retry.
Scorer orchestrator: ScoreSession owns idempotency (skip on trace_id + scorer_version), concurrency (semaphore=5), and the cost budget (judges disable cleanly once spend crosses --max-cost-usd; heuristics keep running so nothing is dropped silently).
CLI: nuvel eval score | report | worst | drift. drift exits non-zero when any agent crosses threshold, ready for a future webhook layer.
Dashboard: loader side-loads scored.jsonl; home view shows a monochrome-friendly score pill (good/warn/bad/none) with the flag list in the tooltip.

Code preservation note

One change to existing code in nuvel/traces_cli.py: _iter_trace_files now skips reserved companion files (initially just scored.jsonl). Without this, the CLI/dashboard would re-ingest scored output as if it were trace data — a latent bug surfaced by the scorer's tests. Tested against the existing test_dashboard_watcher.py and test_dashboard_loader suites — no regressions.

Behavior changes

New CLI subcommand nuvel eval — see docs/reference/cli.md.
New env var EVAL_JUDGE_MODEL — documented in docs/reference/env-vars.md.
Dashboard run cards gain a Score column. Unscored runs render as a muted em-dash; no visual regression when scored.jsonl doesn't exist.

Module layout

nuvel/eval/
  __init__.py     # public surface
  schema.py       # ScoredRun, JudgeResult, Flag, SCORER_VERSION
  heuristics.py   # 8 deterministic flag rules
  stats.py        # per-agent rolling percentiles (p95 cost/latency, p99 tokens)
  rubric.py       # default + YAML override loader; model priority chain
  judge.py        # litellm wrapper with retry + tolerant JSON parse
  scorer.py       # score_run + ScoreSession orchestrator
  writer.py       # append-only scored.jsonl + tolerant index loader
  drift.py        # rolling-window mean comparison per agent
  report.py       # render_report / render_worst / render_drift
  cli.py          # argparse subparsers

Test coverage

+88 new tests, all passing. Full suite: 471 passing.

test_eval_schema.py — round-trip, constants, judge ok property
test_eval_stats.py — percentile math, agent grouping, window cap
test_eval_heuristics.py — every flag rule + stacked-penalty floor
test_eval_rubric.py — default + YAML override + model priority chain + malformed YAML
test_eval_judge.py — prompt assembly, JSON-from-prose, retry, error
test_eval_writer.py — append/load round-trip, tolerance, last-wins
test_eval_scorer.py — orchestration paths (heuristic-floor, judge-disabled, force, version-bump, budget exhaust, idempotent)
test_eval_drift.py — window math + threshold edges + naive-tz handling
test_eval_cli.py — full argparse paths via top-level nuvel.cli.build_parser
test_dashboard_score_integration.py — loader join + view rendering + full FastAPI endpoint

What's NOT in v1 (deferred per spec)

Alert delivery (Slack/webhook) — drift exit code is the seam.
Score migration tooling for large scorer_version bumps.
Golden-set / regression CI gate (separate harness).
Replay with perturbed config (separate harness).
Storing scored output in Postgres (stays JSONL until OrgMemoryService justifies a shared DB).

Test plan

All new and existing unit tests pass locally (471/471).
CLI nuvel eval --help and subcommand helps render correctly.
Real-data smoke test: EVAL_JUDGE_MODEL=openrouter/moonshotai/kimi-k2.5 nuvel eval score --max-cost-usd 0.20 against existing generated-agents/*/traces. Not run overnight — judge calls cost money and require OPENROUTER_API_KEY.
Dashboard smoke: nuvel dashboard with at least one scored run — confirm Score pill renders and tooltip lists flags.

🤖 Generated with Claude Code

Hierarchical, scope-aware memory layer for nuvel agents. Defines the ADK-compatible abstraction, denormalized scope_chain data model on Postgres+pgvector, write-time scope ACLs, and inheritance reads with tier-weighted ranking. Promotion governance and peer ACLs deferred to v2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec defines the online trace scorer (heuristics-first, judge-on-pass) with scored.jsonl siblings, nuvel eval CLI, drift detection, and dashboard integration. Plan sequences the work into 7 phases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plain dataclasses, no pydantic — matches the existing Run dataclass style. Flag is a class of string constants rather than StrEnum so a future scorer can write flag values this version doesn't know about without breaking deserialization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Eight flags (incl. LLM_ERROR over the spec's seven, which conflates with TOOL_ERROR despite different remediation paths). Heuristics consume Run.events (already populated when keep_events=True) — no lazy loader needed. Floor-at-zero on stacked penalties. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

litellm.acompletion behind an injection seam so tests never make real network calls. Tolerant JSON parse (handles models that wrap output in fences/prose). One retry on parse-fail or transient error. Model resolves per spec: rubric.judge_model → EVAL_JUDGE_MODEL env → DEFAULT_FAST_MODEL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ScoreSession owns idempotency (skip on trace_id + scorer_version match, unless --force), concurrency (semaphore=5), and the cost budget (judges disabled cleanly once spend crosses --max-cost-usd; heuristics keep running). One scored.jsonl per trace directory; reserved as a non-trace sibling so traces_cli.py / dashboard never re-ingest scored output as runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four subcommands following the same register(sub) pattern as traces_cli. Drift exits non-zero (rc=2) when any agent crosses the threshold, so it can be wired into a future webhook/alert layer without further work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

TraceLoader side-loads scored.jsonl siblings into a {trace_id: ScoredRun} index; the home view, run_detail, and the SSE-feed endpoint join on it. Unrendered scores render as a muted em-dash so the column reads cleanly when only some runs have been scored. Flags surface in the tooltip; a dedicated column can come in v1.1 if we need it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the eval section to env-vars and cli reference, plus a one-line mention in the README command table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two real findings from the first smoke test against meta_agent traces: 1) Kimi K2.5 (via OpenRouter) returns empty content when asked for response_format=json_object — failure rate went UP, not down, when the hint was added. Now we try with the hint first (modern models honor it) and transparently retry once without it on empty content. Both calls' costs are summed. 2) litellm.completion_cost() prints noisy "Provider List" banners to stderr when the model alias isn't in its pricing table (true for OpenRouter-prefixed Kimi and Anthropic ids). Suppress that, and log one structured WARN per unrecognized model so the silent $0.00 cost-budget bypass is visible. Verified on real traces: Kimi 5/5 ok (was 3/5), Haiku 5/5 ok first-shot with measurably sharper judge notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Folken2 and others added 10 commits May 20, 2026 20:49

docs(eval): document nuvel eval CLI and EVAL_JUDGE_MODEL

d8472d0

Adds the eval section to env-vars and cli reference, plus a one-line mention in the README command table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): online trace scorer — nuvel eval (v1)#35

feat(eval): online trace scorer — nuvel eval (v1)#35
Folken2 wants to merge 10 commits into
mainfrom
feature/eval-harness

Folken2 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Folken2 commented May 20, 2026

Summary

Code preservation note

Behavior changes

Module layout

Test coverage

What's NOT in v1 (deferred per spec)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant