feat(trend): multi-run score trend analyzer — detect behavioral drift across sequential eval runs

## Summary

AgentV's `compare` command does excellent pairwise A/B comparison between two runs. What's missing: detecting whether scores are trending up or down across N sequential runs.

A 10-run regression where each run scores slightly lower than the previous is completely invisible to pairwise comparison — but it's exactly the failure mode that matters in production agent deployment.

## Concrete scenario

You're running weekly eval sweeps on `evals/code-review.yaml` against `claude-sonnet`. Each week's results live in `.agentv/results/runs/`. After 8 weeks:

```
week-1: 0.92  week-2: 0.91  week-3: 0.89  week-4: 0.88
week-5: 0.86  week-6: 0.85  week-7: 0.83  week-8: 0.82
```

`agentv compare week-7/index.jsonl week-8/index.jsonl` → tiny delta, within threshold → no alert.

There's no command that says: "over these 8 runs, slope = **-0.014/run**, direction: **DEGRADING**, regression detected."

## Proposed: `agentv trend` command

```bash
agentv trend --last 10                     # most recent N runs
agentv trend --last 10 --eval code-review  # filter by eval
agentv trend run-01/ run-02/ run-03/       # explicit paths
```

```
Trend Analysis — 8 runs (2026-03-01 → 2026-03-22)
Eval: code-review | Target: claude-sonnet

Slope:     -0.014/run
Direction: DEGRADING ⚠️
Verdict:   REGRESSION DETECTED (threshold: ±0.01/run)
```

## Implementation sketch

The data is all there — `ResultManifestRecord.score`, `.timestamp`, `.eval_id`, `.target` are already in the JSONL manifest files. A `TrendAnalyzer` would:

1. Load N run directories sorted by `timestamp`
2. Compute mean score per run (already done in `results summary`)
3. OLS via `statistics.linearRegression` (stdlib, zero new deps)
4. Return `{ slope, direction, anyRegression }` + rich table output

**CI gate integration:**
```bash
agentv trend --last 10 --threshold 0.01 || exit 1
```

This is the proactive version of compare's reactive A/B check. Pairwise comparison tells you "did this run get worse than last run?" Trend analysis tells you "has this agent been getting progressively worse for 10 runs?"

## Related

This pattern applies directly to AgentV Studio's quality gates (#788) — a trend gate is more robust than a single-run threshold gate.

Cross-session behavioral drift gap documented across 65+ independent agent tools in: [PDR in Production v2.6](https://doi.org/10.5281/zenodo.19376669)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(trend): multi-run score trend analyzer — detect behavioral drift across sequential eval runs #913

Summary

Concrete scenario

Proposed: `agentv trend` command

Implementation sketch

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(trend): multi-run score trend analyzer — detect behavioral drift across sequential eval runs #913

Description

Summary

Concrete scenario

Proposed: agentv trend command

Implementation sketch

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposed: `agentv trend` command