feat(eval): online trace scorer — nuvel eval (v1)#35
Open
Folken2 wants to merge 10 commits into
Open
Conversation
Hierarchical, scope-aware memory layer for nuvel agents. Defines the ADK-compatible abstraction, denormalized scope_chain data model on Postgres+pgvector, write-time scope ACLs, and inheritance reads with tier-weighted ranking. Promotion governance and peer ACLs deferred to v2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec defines the online trace scorer (heuristics-first, judge-on-pass) with scored.jsonl siblings, nuvel eval CLI, drift detection, and dashboard integration. Plan sequences the work into 7 phases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plain dataclasses, no pydantic — matches the existing Run dataclass style. Flag is a class of string constants rather than StrEnum so a future scorer can write flag values this version doesn't know about without breaking deserialization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eight flags (incl. LLM_ERROR over the spec's seven, which conflates with TOOL_ERROR despite different remediation paths). Heuristics consume Run.events (already populated when keep_events=True) — no lazy loader needed. Floor-at-zero on stacked penalties. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
litellm.acompletion behind an injection seam so tests never make real network calls. Tolerant JSON parse (handles models that wrap output in fences/prose). One retry on parse-fail or transient error. Model resolves per spec: rubric.judge_model → EVAL_JUDGE_MODEL env → DEFAULT_FAST_MODEL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ScoreSession owns idempotency (skip on trace_id + scorer_version match, unless --force), concurrency (semaphore=5), and the cost budget (judges disabled cleanly once spend crosses --max-cost-usd; heuristics keep running). One scored.jsonl per trace directory; reserved as a non-trace sibling so traces_cli.py / dashboard never re-ingest scored output as runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four subcommands following the same register(sub) pattern as traces_cli. Drift exits non-zero (rc=2) when any agent crosses the threshold, so it can be wired into a future webhook/alert layer without further work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TraceLoader side-loads scored.jsonl siblings into a {trace_id: ScoredRun}
index; the home view, run_detail, and the SSE-feed endpoint join on it.
Unrendered scores render as a muted em-dash so the column reads cleanly
when only some runs have been scored. Flags surface in the tooltip; a
dedicated column can come in v1.1 if we need it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the eval section to env-vars and cli reference, plus a one-line mention in the README command table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two real findings from the first smoke test against meta_agent traces: 1) Kimi K2.5 (via OpenRouter) returns empty content when asked for response_format=json_object — failure rate went UP, not down, when the hint was added. Now we try with the hint first (modern models honor it) and transparently retry once without it on empty content. Both calls' costs are summed. 2) litellm.completion_cost() prints noisy "Provider List" banners to stderr when the model alias isn't in its pricing table (true for OpenRouter-prefixed Kimi and Anthropic ids). Suppress that, and log one structured WARN per unrecognized model so the silent $0.00 cost-budget bypass is visible. Verified on real traces: Kimi 5/5 ok (was 3/5), Haiku 5/5 ok first-shot with measurably sharper judge notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First end-to-end implementation of the eval harness per spec
2026-05-20-eval-harness-v1-design.mdand plan2026-05-20-eval-harness.md. Online trace scorer: heuristics-first, judge-on-pass,scored.jsonlsiblings, fullnuvel evalCLI, dashboard score column, drift detection.tool_error,llm_error,no_assistant_output,incomplete_trace,excessive_turns,tool_loop,cost_outlier,latency_outlier,token_bloat— deterministic, pure functions overRun.events.litellm.acompletionvia an injection seam (tests never make real network calls). Model resolves: rubric →EVAL_JUDGE_MODELenv →DEFAULT_FAST_MODEL. Tolerant JSON parse, one retry.ScoreSessionowns idempotency (skip ontrace_id+scorer_version), concurrency (semaphore=5), and the cost budget (judges disable cleanly once spend crosses--max-cost-usd; heuristics keep running so nothing is dropped silently).nuvel eval score | report | worst | drift.driftexits non-zero when any agent crosses threshold, ready for a future webhook layer.scored.jsonl; home view shows a monochrome-friendly score pill (good/warn/bad/none) with the flag list in the tooltip.Code preservation note
One change to existing code in
nuvel/traces_cli.py:_iter_trace_filesnow skips reserved companion files (initially justscored.jsonl). Without this, the CLI/dashboard would re-ingest scored output as if it were trace data — a latent bug surfaced by the scorer's tests. Tested against the existingtest_dashboard_watcher.pyandtest_dashboard_loadersuites — no regressions.Behavior changes
nuvel eval— see docs/reference/cli.md.EVAL_JUDGE_MODEL— documented in docs/reference/env-vars.md.scored.jsonldoesn't exist.Module layout
Test coverage
+88 new tests, all passing. Full suite: 471 passing.
test_eval_schema.py— round-trip, constants, judge ok propertytest_eval_stats.py— percentile math, agent grouping, window captest_eval_heuristics.py— every flag rule + stacked-penalty floortest_eval_rubric.py— default + YAML override + model priority chain + malformed YAMLtest_eval_judge.py— prompt assembly, JSON-from-prose, retry, errortest_eval_writer.py— append/load round-trip, tolerance, last-winstest_eval_scorer.py— orchestration paths (heuristic-floor, judge-disabled, force, version-bump, budget exhaust, idempotent)test_eval_drift.py— window math + threshold edges + naive-tz handlingtest_eval_cli.py— full argparse paths via top-levelnuvel.cli.build_parsertest_dashboard_score_integration.py— loader join + view rendering + full FastAPI endpointWhat's NOT in v1 (deferred per spec)
scorer_versionbumps.OrgMemoryServicejustifies a shared DB).Test plan
nuvel eval --helpand subcommand helps render correctly.EVAL_JUDGE_MODEL=openrouter/moonshotai/kimi-k2.5 nuvel eval score --max-cost-usd 0.20against existinggenerated-agents/*/traces. Not run overnight — judge calls cost money and requireOPENROUTER_API_KEY.nuvel dashboardwith at least one scored run — confirm Score pill renders and tooltip lists flags.🤖 Generated with Claude Code