feat(bench): implement baseline comparison (--baseline flag)

## Description
Run the same benchmark twice — once with `[memory] enabled=true` and once with `[memory] enabled=false` — and produce a delta comparison showing Zeph's memory value.

Part of epic #2827. See spec: `.local/specs/zeph-bench/spec.md` FR-006, US-004.

## Scope
- `--baseline` flag on `zeph bench run`
- Runs the full scenario set twice: first pass with memory enabled, second pass with memory disabled (config override)
- Writes `baseline/memory-on/` and `baseline/memory-off/` result directories
- Top-level `summary.md` includes a delta table: per-scenario delta score, aggregate delta, interpretation note
- `BaselineComparison` struct serialized to top-level `comparison.json`

## Acceptance Criteria
- [ ] Both memory-on and memory-off result files written to correct subdirectories
- [ ] Delta table present in top-level `summary.md`
- [ ] Aggregate delta = mean(memory-on scores) - mean(memory-off scores)
- [ ] Each pass uses the same isolation reset between scenarios

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): implement baseline comparison (--baseline flag) #2834

Description

Scope

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(bench): implement baseline comparison (--baseline flag) #2834

Description

Description

Scope

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions