Skip to content

feat(trend): multi-run score trend analyzer — detect behavioral drift across sequential eval runs #913

@nanookclaw

Description

@nanookclaw

Summary

AgentV's compare command does excellent pairwise A/B comparison between two runs. What's missing: detecting whether scores are trending up or down across N sequential runs.

A 10-run regression where each run scores slightly lower than the previous is completely invisible to pairwise comparison — but it's exactly the failure mode that matters in production agent deployment.

Concrete scenario

You're running weekly eval sweeps on evals/code-review.yaml against claude-sonnet. Each week's results live in .agentv/results/runs/. After 8 weeks:

week-1: 0.92  week-2: 0.91  week-3: 0.89  week-4: 0.88
week-5: 0.86  week-6: 0.85  week-7: 0.83  week-8: 0.82

agentv compare week-7/index.jsonl week-8/index.jsonl → tiny delta, within threshold → no alert.

There's no command that says: "over these 8 runs, slope = -0.014/run, direction: DEGRADING, regression detected."

Proposed: agentv trend command

agentv trend --last 10                     # most recent N runs
agentv trend --last 10 --eval code-review  # filter by eval
agentv trend run-01/ run-02/ run-03/       # explicit paths
Trend Analysis — 8 runs (2026-03-01 → 2026-03-22)
Eval: code-review | Target: claude-sonnet

Slope:     -0.014/run
Direction: DEGRADING ⚠️
Verdict:   REGRESSION DETECTED (threshold: ±0.01/run)

Implementation sketch

The data is all there — ResultManifestRecord.score, .timestamp, .eval_id, .target are already in the JSONL manifest files. A TrendAnalyzer would:

  1. Load N run directories sorted by timestamp
  2. Compute mean score per run (already done in results summary)
  3. OLS via statistics.linearRegression (stdlib, zero new deps)
  4. Return { slope, direction, anyRegression } + rich table output

CI gate integration:

agentv trend --last 10 --threshold 0.01 || exit 1

This is the proactive version of compare's reactive A/B check. Pairwise comparison tells you "did this run get worse than last run?" Trend analysis tells you "has this agent been getting progressively worse for 10 runs?"

Related

This pattern applies directly to AgentV Studio's quality gates (#788) — a trend gate is more robust than a single-run threshold gate.

Cross-session behavioral drift gap documented across 65+ independent agent tools in: PDR in Production v2.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreAnything pertaining to core functionality of AgentVenhancementNew feature or request

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions