Skip to content

feat(eval): online trace scorer — nuvel eval (v1)#35

Open
Folken2 wants to merge 10 commits into
mainfrom
feature/eval-harness
Open

feat(eval): online trace scorer — nuvel eval (v1)#35
Folken2 wants to merge 10 commits into
mainfrom
feature/eval-harness

Conversation

@Folken2
Copy link
Copy Markdown
Owner

@Folken2 Folken2 commented May 20, 2026

Summary

First end-to-end implementation of the eval harness per spec 2026-05-20-eval-harness-v1-design.md and plan 2026-05-20-eval-harness.md. Online trace scorer: heuristics-first, judge-on-pass, scored.jsonl siblings, full nuvel eval CLI, dashboard score column, drift detection.

  • Heuristics (8 flags): tool_error, llm_error, no_assistant_output, incomplete_trace, excessive_turns, tool_loop, cost_outlier, latency_outlier, token_bloat — deterministic, pure functions over Run.events.
  • Judge: litellm.acompletion via an injection seam (tests never make real network calls). Model resolves: rubric → EVAL_JUDGE_MODEL env → DEFAULT_FAST_MODEL. Tolerant JSON parse, one retry.
  • Scorer orchestrator: ScoreSession owns idempotency (skip on trace_id + scorer_version), concurrency (semaphore=5), and the cost budget (judges disable cleanly once spend crosses --max-cost-usd; heuristics keep running so nothing is dropped silently).
  • CLI: nuvel eval score | report | worst | drift. drift exits non-zero when any agent crosses threshold, ready for a future webhook layer.
  • Dashboard: loader side-loads scored.jsonl; home view shows a monochrome-friendly score pill (good/warn/bad/none) with the flag list in the tooltip.

Code preservation note

One change to existing code in nuvel/traces_cli.py: _iter_trace_files now skips reserved companion files (initially just scored.jsonl). Without this, the CLI/dashboard would re-ingest scored output as if it were trace data — a latent bug surfaced by the scorer's tests. Tested against the existing test_dashboard_watcher.py and test_dashboard_loader suites — no regressions.

Behavior changes

  • New CLI subcommand nuvel eval — see docs/reference/cli.md.
  • New env var EVAL_JUDGE_MODEL — documented in docs/reference/env-vars.md.
  • Dashboard run cards gain a Score column. Unscored runs render as a muted em-dash; no visual regression when scored.jsonl doesn't exist.

Module layout

nuvel/eval/
  __init__.py     # public surface
  schema.py       # ScoredRun, JudgeResult, Flag, SCORER_VERSION
  heuristics.py   # 8 deterministic flag rules
  stats.py        # per-agent rolling percentiles (p95 cost/latency, p99 tokens)
  rubric.py       # default + YAML override loader; model priority chain
  judge.py        # litellm wrapper with retry + tolerant JSON parse
  scorer.py       # score_run + ScoreSession orchestrator
  writer.py       # append-only scored.jsonl + tolerant index loader
  drift.py        # rolling-window mean comparison per agent
  report.py       # render_report / render_worst / render_drift
  cli.py          # argparse subparsers

Test coverage

+88 new tests, all passing. Full suite: 471 passing.

  • test_eval_schema.py — round-trip, constants, judge ok property
  • test_eval_stats.py — percentile math, agent grouping, window cap
  • test_eval_heuristics.py — every flag rule + stacked-penalty floor
  • test_eval_rubric.py — default + YAML override + model priority chain + malformed YAML
  • test_eval_judge.py — prompt assembly, JSON-from-prose, retry, error
  • test_eval_writer.py — append/load round-trip, tolerance, last-wins
  • test_eval_scorer.py — orchestration paths (heuristic-floor, judge-disabled, force, version-bump, budget exhaust, idempotent)
  • test_eval_drift.py — window math + threshold edges + naive-tz handling
  • test_eval_cli.py — full argparse paths via top-level nuvel.cli.build_parser
  • test_dashboard_score_integration.py — loader join + view rendering + full FastAPI endpoint

What's NOT in v1 (deferred per spec)

  • Alert delivery (Slack/webhook) — drift exit code is the seam.
  • Score migration tooling for large scorer_version bumps.
  • Golden-set / regression CI gate (separate harness).
  • Replay with perturbed config (separate harness).
  • Storing scored output in Postgres (stays JSONL until OrgMemoryService justifies a shared DB).

Test plan

  • All new and existing unit tests pass locally (471/471).
  • CLI nuvel eval --help and subcommand helps render correctly.
  • Real-data smoke test: EVAL_JUDGE_MODEL=openrouter/moonshotai/kimi-k2.5 nuvel eval score --max-cost-usd 0.20 against existing generated-agents/*/traces. Not run overnight — judge calls cost money and require OPENROUTER_API_KEY.
  • Dashboard smoke: nuvel dashboard with at least one scored run — confirm Score pill renders and tooltip lists flags.

🤖 Generated with Claude Code

Folken2 and others added 10 commits May 20, 2026 20:49
Hierarchical, scope-aware memory layer for nuvel agents. Defines the
ADK-compatible abstraction, denormalized scope_chain data model on
Postgres+pgvector, write-time scope ACLs, and inheritance reads with
tier-weighted ranking. Promotion governance and peer ACLs deferred to v2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec defines the online trace scorer (heuristics-first, judge-on-pass)
with scored.jsonl siblings, nuvel eval CLI, drift detection, and
dashboard integration. Plan sequences the work into 7 phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plain dataclasses, no pydantic — matches the existing Run dataclass
style. Flag is a class of string constants rather than StrEnum so a
future scorer can write flag values this version doesn't know about
without breaking deserialization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eight flags (incl. LLM_ERROR over the spec's seven, which conflates with
TOOL_ERROR despite different remediation paths). Heuristics consume
Run.events (already populated when keep_events=True) — no lazy loader
needed. Floor-at-zero on stacked penalties.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
litellm.acompletion behind an injection seam so tests never make real
network calls. Tolerant JSON parse (handles models that wrap output in
fences/prose). One retry on parse-fail or transient error. Model
resolves per spec: rubric.judge_model → EVAL_JUDGE_MODEL env →
DEFAULT_FAST_MODEL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ScoreSession owns idempotency (skip on trace_id + scorer_version match,
unless --force), concurrency (semaphore=5), and the cost budget (judges
disabled cleanly once spend crosses --max-cost-usd; heuristics keep
running). One scored.jsonl per trace directory; reserved as a non-trace
sibling so traces_cli.py / dashboard never re-ingest scored output as
runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four subcommands following the same register(sub) pattern as
traces_cli. Drift exits non-zero (rc=2) when any agent crosses the
threshold, so it can be wired into a future webhook/alert layer
without further work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TraceLoader side-loads scored.jsonl siblings into a {trace_id: ScoredRun}
index; the home view, run_detail, and the SSE-feed endpoint join on it.
Unrendered scores render as a muted em-dash so the column reads cleanly
when only some runs have been scored. Flags surface in the tooltip; a
dedicated column can come in v1.1 if we need it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the eval section to env-vars and cli reference, plus a one-line
mention in the README command table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two real findings from the first smoke test against meta_agent traces:

1) Kimi K2.5 (via OpenRouter) returns empty content when asked for
   response_format=json_object — failure rate went UP, not down, when
   the hint was added. Now we try with the hint first (modern models
   honor it) and transparently retry once without it on empty content.
   Both calls' costs are summed.

2) litellm.completion_cost() prints noisy "Provider List" banners to
   stderr when the model alias isn't in its pricing table (true for
   OpenRouter-prefixed Kimi and Anthropic ids). Suppress that, and log
   one structured WARN per unrecognized model so the silent $0.00
   cost-budget bypass is visible.

Verified on real traces: Kimi 5/5 ok (was 3/5), Haiku 5/5 ok first-shot
with measurably sharper judge notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant