Skip to content

Commit dd77643

Browse files
samrusaniSami Rusani
andauthored
P12-S4: ship public eval harness (#155)
Co-authored-by: Sami Rusani <sr@samirusani>
1 parent 6d10d1b commit dd77643

24 files changed

Lines changed: 3480 additions & 89 deletions

.ai/handoff/CURRENT_STATE.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88
- `v0.2.0` is released.
99
- Phase 12 Sprint 1 (`P12-S1`) is shipped.
1010
- Phase 12 Sprint 2 (`P12-S2`) is shipped.
11-
- Phase 12 Sprint 3 (`P12-S3`) is the active execution sprint.
11+
- Phase 12 Sprint 3 (`P12-S3`) is shipped.
12+
- Phase 12 Sprint 4 (`P12-S4`) is the active execution sprint.
1213

1314
## Current Baseline Truth
1415
- Alice has typed memory, provenance, trust classes, correction/supersession behavior, open loops, recall, resumption, and explainability.
@@ -17,17 +18,17 @@
1718
- The codebase also includes the shipped `P12-S2` memory mutation candidate and operation foundation.
1819

1920
## Not Yet First-Class In Repo
20-
- public multi-suite eval harness for recall/resumption/correction/contradiction/open-loops
2121
- task-adaptive brief compiler separated from current briefing surfaces
2222

2323
## Phase Transition Note
2424
- Phase 12 is active.
2525
- `P12-S1` is complete and establishes the retrieval baseline.
2626
- `P12-S2` is complete and establishes the mutation baseline.
27-
- `P12-S3` is the active sprint and should build contradiction and trust handling on top of shipped retrieval and mutation behavior.
28-
- The current `P12-S3` branch implements contradiction cases and trust-signal storage, pending Control Tower merge approval.
27+
- `P12-S3` is complete and establishes the contradiction/trust baseline.
28+
- `P12-S4` is the active sprint and should benchmark shipped retrieval, mutation, and contradiction behavior without reopening those systems.
29+
- The current `P12-S4` branch implements the public eval harness, fixture catalog, and checked-in baseline artifact, pending Control Tower merge approval.
2930

3031
## Immediate Control Tower Decisions Needed
31-
- Decide contradiction object attachment scope: continuity objects, memories, or both.
32-
- Decide trust-signal storage and ranking integration policy.
33-
- Decide the final contradiction and trust API surface shape for Phase 12.
32+
- Decide public eval suite taxonomy and baseline artifact format.
33+
- Decide what eval artifacts are committed versus generated locally.
34+
- Decide whether `P12-S4` stays CLI-first or keeps the current branch `/v1/evals/*` API surface.

ARCHITECTURE.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Scope Boundary
44
- **Shipped baseline:** Phases 9-11 and Bridge `B1` through `B4`.
5-
- **Current repo execution posture:** `v0.2.0` is released; `P12-S1` and `P12-S2` are shipped; `P12-S3` is the active sprint.
5+
- **Current repo execution posture:** `v0.2.0` is released; `P12-S1`, `P12-S2`, and `P12-S3` are shipped; `P12-S4` is the active sprint.
66
- **Phase 12 delta:** retrieval quality, mutation explicitness, contradiction handling, public evals, and adaptive briefing.
77

88
## Current System Overview
@@ -113,13 +113,13 @@ Delivered additions:
113113
Important baseline note: `P12-S2` is now the mutation baseline for the rest of Phase 12 and should not be reopened except where later sprint integration requires it.
114114

115115
### P12-S3: Contradiction Detection + Trust Calibration
116-
Add first-class conflict records and auditable trust adjustments.
116+
Shipped in `P12-S3`:
117117

118-
Planned additions:
118+
Delivered additions:
119119
- `contradiction_cases`
120120
- `trust_signals`
121121

122-
Important baseline note: `P12-S3` should layer on top of shipped retrieval traces and shipped mutation operations rather than redesigning either subsystem.
122+
Important baseline note: `P12-S3` is now the contradiction/trust baseline for the rest of Phase 12 and should not be reopened except where later sprint integration requires it.
123123

124124
### P12-S4: Public Eval Harness
125125
Expand the current retrieval evaluation foundation into public multi-suite benchmark runs and checked-in baseline reports.
@@ -130,6 +130,9 @@ Planned additions:
130130
- `eval_runs`
131131
- `eval_results`
132132

133+
Important baseline note: `P12-S4` should measure shipped retrieval, mutation, and contradiction behavior rather than redesign those systems.
134+
Source-of-truth note: the checked-in fixture catalog defines the authoritative suite/case set and ordering; persisted eval suite/case rows are synchronized snapshots for execution and audit, not an independent planning surface.
135+
133136
### P12-S5: Task-Adaptive Briefing
134137
Separate durable memory from output-specific briefing layers.
135138

BUILD_REPORT.md

Lines changed: 48 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -2,72 +2,81 @@
22

33
## Sprint Objective
44

5-
Implement `P12-S3` contradiction detection and trust calibration so conflicting continuity state becomes reviewable, auditable, and visible in retrieval and explain flows.
5+
Implement `P12-S4` public eval harness so Alice can run reproducible local eval suites, persist suite/case/run/result records, emit stable baseline report artifacts, and document what the measured quality surface means.
66

77
## Completed Work
88

9-
- Added contradiction and trust persistence with `contradiction_cases` and `trust_signals`.
10-
- Added contradiction detection for direct fact, preference, temporal, and source-hierarchy conflicts.
11-
- Added contradiction syncing on continuity create, review, explain, and recall paths.
12-
- Added contradiction-aware retrieval penalties and exposed contradiction counts and penalty scores in recall ordering metadata.
13-
- Added contradiction visibility and active trust-signal counts in continuity explain output.
14-
- Added current-branch contradiction case inspection and resolution flows in API, CLI, and MCP.
15-
- Added current-branch trust signal inspection in API, CLI, and MCP.
16-
- Added focused sprint documentation in `docs/memory/p12-s3-contradictions-trust-calibration.md`, explicitly framed as branch behavior where Control Tower decisions are still pending.
17-
- Added sprint-owned unit and integration coverage for detection, trust persistence, retrieval penalty behavior, explain visibility, CLI smoke, MCP smoke, and migration shape.
9+
- Added public eval persistence tables for `eval_suites`, `eval_cases`, `eval_runs`, and `eval_results`.
10+
- Added `alicebot_api.public_evals` with:
11+
- fixture-catalog loading
12+
- suite/case syncing into the database
13+
- fixture-backed recall, resumption, correction, contradiction, and open-loop evaluators
14+
- canonical report generation with stable digests
15+
- report writing helper for checked-in baseline artifacts
16+
- Added current-branch public eval API surfaces:
17+
- `GET /v1/evals/suites`
18+
- `POST /v1/evals/runs`
19+
- `GET /v1/evals/runs`
20+
- `GET /v1/evals/runs/{eval_run_id}`
21+
- Made the checked-in fixture catalog authoritative for suite listing and run selection.
22+
- Added pruning for persisted suite/case rows so removed catalog entries do not survive as stale runtime state.
23+
- Added explicit validation for unknown `suite_key` filters instead of silently returning partial or empty runs.
24+
- Added CLI surfaces:
25+
- `alicebot evals suites`
26+
- `alicebot evals run`
27+
- `alicebot evals runs`
28+
- `alicebot evals show`
29+
- Added public fixture definitions in `eval/fixtures/public_eval_suites.json`.
30+
- Added checked-in current-branch baseline report artifact in `eval/baselines/public_eval_harness_v1.json`, with final committed artifact format still pending Control Tower confirmation.
31+
- Added sprint-owned docs in `docs/evals/public_eval_harness.md`, explicitly framed as current branch behavior where API and artifact decisions are still pending.
32+
- Added focused unit and integration coverage for the runner, migration, API, CLI, and baseline reproduction path.
1833

1934
## Incomplete Work
2035

21-
- None within the sprint packet scope.
36+
- None inside the sprint packet scope.
2237

2338
## Files Changed
2439

25-
- `.ai/handoff/CURRENT_STATE.md`
26-
- `ARCHITECTURE.md`
2740
- `BUILD_REPORT.md`
41+
- `RULES.md`
42+
- `ARCHITECTURE.md`
2843
- `CURRENT_STATE.md`
44+
- `.ai/handoff/CURRENT_STATE.md`
2945
- `PRODUCT_BRIEF.md`
30-
- `REVIEW_REPORT.md`
3146
- `ROADMAP.md`
32-
- `apps/api/alembic/versions/20260414_0059_phase12_contradictions_trust_calibration.py`
33-
- `apps/api/src/alicebot_api/continuity_contradictions.py`
34-
- `apps/api/src/alicebot_api/continuity_trust.py`
35-
- `apps/api/src/alicebot_api/store.py`
47+
- `REVIEW_REPORT.md`
48+
- `apps/api/alembic/versions/20260414_0060_phase12_public_eval_harness.py`
49+
- `apps/api/src/alicebot_api/cli.py`
3650
- `apps/api/src/alicebot_api/contracts.py`
37-
- `apps/api/src/alicebot_api/continuity_recall.py`
38-
- `apps/api/src/alicebot_api/continuity_explainability.py`
39-
- `apps/api/src/alicebot_api/continuity_evidence.py`
40-
- `apps/api/src/alicebot_api/continuity_objects.py`
41-
- `apps/api/src/alicebot_api/continuity_review.py`
4251
- `apps/api/src/alicebot_api/main.py`
43-
- `apps/api/src/alicebot_api/cli.py`
44-
- `apps/api/src/alicebot_api/cli_formatting.py`
45-
- `apps/api/src/alicebot_api/mcp_tools.py`
46-
- `tests/unit/test_20260414_0059_phase12_contradictions_trust_calibration.py`
47-
- `tests/unit/test_continuity_contradictions.py`
52+
- `apps/api/src/alicebot_api/public_evals.py`
53+
- `apps/api/src/alicebot_api/store.py`
54+
- `scripts/check_control_doc_truth.py`
55+
- `docs/evals/public_eval_harness.md`
56+
- `eval/baselines/public_eval_harness_v1.json`
57+
- `eval/fixtures/public_eval_suites.json`
58+
- `tests/integration/test_cli_integration.py`
59+
- `tests/integration/test_public_evals_api.py`
60+
- `tests/unit/test_20260414_0060_phase12_public_eval_harness.py`
4861
- `tests/unit/test_cli.py`
49-
- `tests/unit/test_mcp.py`
5062
- `tests/unit/test_main.py`
51-
- `tests/integration/test_contradictions_api.py`
52-
- `tests/integration/test_cli_integration.py`
53-
- `tests/integration/test_mcp_cli_parity.py`
54-
- `docs/memory/p12-s3-contradictions-trust-calibration.md`
55-
- `scripts/check_control_doc_truth.py`
63+
- `tests/unit/test_public_evals.py`
5664

5765
## Tests Run
5866

59-
- `./.venv/bin/pytest tests/unit/test_continuity_contradictions.py tests/unit/test_20260414_0059_phase12_contradictions_trust_calibration.py tests/unit/test_continuity_recall.py tests/unit/test_continuity_review.py tests/unit/test_cli.py tests/unit/test_mcp.py tests/unit/test_main.py tests/integration/test_contradictions_api.py tests/integration/test_cli_integration.py tests/integration/test_mcp_cli_parity.py -q`
60-
- Result: PASS (`104 passed`)
67+
- `./.venv/bin/pytest tests/unit/test_public_evals.py tests/unit/test_20260414_0060_phase12_public_eval_harness.py tests/unit/test_cli.py tests/unit/test_main.py tests/integration/test_public_evals_api.py tests/integration/test_cli_integration.py tests/integration/test_retrieval_evaluation_api.py -q`
68+
- Result: PASS (`83 passed`)
6169
- `./.venv/bin/python scripts/check_control_doc_truth.py`
6270
- Result: PASS
63-
- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/memory`
71+
- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/evals eval/fixtures eval/baselines`
6472
- Result: PASS (no matches)
6573

6674
## Blockers/Issues
6775

6876
- No sprint blocker remains.
69-
- Final product policy is still pending for the Control Tower decisions called out in the sprint packet, including contradiction attachment scope, long-term API shape, and the durable trust-signal policy boundary.
77+
- The recall suite keeps one non-gating coverage snapshot for entity-edge expansion. It records the current shipped output with `score=0.0` while the suite still passes because the catalog marks that case as observational rather than a strict gate.
78+
- Final product policy is still pending for the Control Tower decisions called out in the sprint packet, including the committed artifact format and whether `/v1/evals/*` remains part of the accepted Phase 12 surface.
7079

7180
## Recommended Next Step
7281

73-
Request Control Tower merge review against the current `P12-S3` branch head.
82+
Request Control Tower merge review against the current `P12-S4` branch head.

CURRENT_STATE.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ Canonical handoff state lives at [.ai/handoff/CURRENT_STATE.md](.ai/handoff/CURR
1111
- `v0.2.0` is released.
1212
- Phase 12 Sprint 1 (`P12-S1`) is shipped.
1313
- Phase 12 Sprint 2 (`P12-S2`) is shipped.
14-
- Phase 12 Sprint 3 (`P12-S3`) is the active execution sprint.
14+
- Phase 12 Sprint 3 (`P12-S3`) is shipped.
15+
- Phase 12 Sprint 4 (`P12-S4`) is the active execution sprint.
1516

1617
## Current Baseline Truth
1718
- Alice has typed memory, provenance, trust classes, correction/supersession behavior, open loops, recall, resumption, and explainability.
@@ -20,17 +21,17 @@ Canonical handoff state lives at [.ai/handoff/CURRENT_STATE.md](.ai/handoff/CURR
2021
- The codebase also includes the shipped `P12-S2` memory mutation candidate and operation foundation.
2122

2223
## Not Yet First-Class In Repo
23-
- public multi-suite eval harness for recall/resumption/correction/contradiction/open-loops
2424
- task-adaptive brief compiler separated from current briefing surfaces
2525

2626
## Phase Transition Note
2727
- Phase 12 is active.
2828
- `P12-S1` is complete and establishes the retrieval baseline.
2929
- `P12-S2` is complete and establishes the mutation baseline.
30-
- `P12-S3` is the active sprint and should build contradiction and trust handling on top of shipped retrieval and mutation behavior.
31-
- The current `P12-S3` branch implements contradiction cases and trust-signal storage, pending Control Tower merge approval.
30+
- `P12-S3` is complete and establishes the contradiction/trust baseline.
31+
- `P12-S4` is the active sprint and should benchmark shipped retrieval, mutation, and contradiction behavior without reopening those systems.
32+
- The current `P12-S4` branch implements the public eval harness, fixture catalog, and checked-in baseline artifact, pending Control Tower merge approval.
3233

3334
## Immediate Control Tower Decisions Needed
34-
- Decide contradiction object attachment scope: continuity objects, memories, or both.
35-
- Decide trust-signal storage and ranking integration policy.
36-
- Decide the final contradiction and trust API surface shape for Phase 12.
35+
- Decide public eval suite taxonomy and baseline artifact format.
36+
- Decide what eval artifacts are committed versus generated locally.
37+
- Decide whether `P12-S4` stays CLI-first or keeps the current branch `/v1/evals/*` API surface.

PRODUCT_BRIEF.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@ Alice is a pre-1.0 continuity platform for AI agents and agent-assisted workflow
1515
- Phase 12 is active.
1616
- `P12-S1` Hybrid Retrieval + Reranking is shipped.
1717
- `P12-S2` Automated Memory Operations is shipped.
18-
- `P12-S3` Contradiction Detection + Trust Calibration is the active sprint.
18+
- `P12-S3` Contradiction Detection + Trust Calibration is shipped.
19+
- `P12-S4` Public Eval Harness is the active sprint.
1920

2021
## Next Phase
2122
### Phase 12: Retrieval Quality + Adaptive Continuity

REVIEW_REPORT.md

Lines changed: 32 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,58 @@
1-
# REVIEW REPORT
1+
# REVIEW_REPORT
22

33
## verdict
4+
45
PASS
56

67
## criteria met
7-
- Obvious contradictory facts are flagged automatically and persisted as contradiction cases.
8-
- Contradiction status appears in explain output, including open/resolved counts and penalty score.
9-
- Unresolved contradictions reduce retrieval confidence/rank through contradiction penalty integration.
10-
- Trust changes are stored and inspectable through API, CLI, and MCP trust-signal surfaces.
11-
- Contradiction review actions are auditable with stored resolution action, note, and timestamps.
12-
- `P12-S3` layers onto shipped retrieval and correction/mutation behavior without reopening those systems.
13-
- No local workstation paths, usernames, or machine-specific identifiers were found in the reviewed changed files or sprint docs.
8+
9+
- The sprint now runs the public eval harness end to end locally through both CLI and API surfaces.
10+
- The checked-in fixture catalog is authoritative for suite listing and run selection.
11+
- Eval runs now prune removed suite/case definitions from persisted sync state, so the effective run set stays aligned with the checked-in catalog.
12+
- Unknown `suite_key` filters now fail fast instead of silently producing partial or empty runs.
13+
- The repo includes the checked-in baseline report artifact and sprint-owned eval docs.
14+
- The sprint still measures shipped retrieval, mutation, and contradiction behavior without reopening those systems.
15+
- Verification passed:
16+
- `./.venv/bin/pytest tests/unit/test_public_evals.py tests/unit/test_20260414_0060_phase12_public_eval_harness.py tests/unit/test_cli.py tests/unit/test_main.py tests/integration/test_public_evals_api.py tests/integration/test_cli_integration.py tests/integration/test_retrieval_evaluation_api.py -q`
17+
- `./.venv/bin/python scripts/check_control_doc_truth.py`
18+
- I re-checked the changed files and did not find local workstation paths, usernames, or similar machine-specific identifiers in the sprint-owned docs or artifacts.
1419

1520
## criteria missed
21+
1622
- None.
1723

1824
## quality issues
19-
- Minor scope spill remains in control-doc churn outside the sprint-owned runtime surface, but it is not causing behavioral or acceptance risk.
25+
26+
- No blocking quality issue remains in the sprint scope.
27+
- The recall suite still carries one observational snapshot case for entity-edge expansion with `score=0.0` while the suite passes by design. That is documented and not a regression, but it remains a product decision for a later sprint.
2028

2129
## regression risks
22-
- Low after the follow-up fix.
23-
- The main previously identified risks are now covered by regression tests:
24-
- superseded history no longer reopens active contradictions
25-
- naive temporal ISO values are normalized before overlap detection
30+
31+
- Low. The main prior risk around stale catalog state persisting across runs is addressed by pruning and by reading suite definitions directly from the checked-in fixture catalog for listing.
2632

2733
## docs issues
34+
2835
- None blocking.
29-
- Sprint docs now clarify that contradiction detection only uses live continuity objects (`active` and `stale`) and normalizes temporal bounds to UTC.
30-
- Sprint docs frame contradiction attachment, trust-ledger durability, and API surface choices as current branch behavior where Control Tower decisions are still pending, rather than as permanently settled product policy.
31-
- The local-path scrub command excludes `BUILD_REPORT.md` and `REVIEW_REPORT.md` because the reports now carry the literal search pattern as part of their recorded verification steps.
36+
- `BUILD_REPORT.md` now reflects the control-doc updates and the follow-up verification run.
37+
- Sprint docs now frame `/v1/evals/*` and the checked-in JSON baseline report as current branch behavior where the final Control Tower contract is still pending, rather than silently treating those choices as permanently settled product policy.
3238

3339
## should anything be added to RULES.md?
34-
- No required update.
40+
41+
- Already addressed in this branch. `RULES.md` now states that canonical fixture catalogs must either be pruned into runtime state or used directly as the source of truth.
3542

3643
## should anything update ARCHITECTURE.md?
37-
- No required update for sprint acceptance.
38-
- When `P12-S3` becomes shipped baseline truth, the data-model summary should explicitly include `contradiction_cases` and `trust_signals`.
44+
45+
- Already addressed in this branch. `ARCHITECTURE.md` now states that the checked-in fixture catalog is authoritative and persisted suite/case rows are synchronized snapshots.
3946

4047
## recommended next action
41-
- Proceed with merge review for `P12-S3`.
42-
- Carry the new superseded-history and naive-temporal contradiction cases forward in future regression suites.
48+
49+
- Pass the sprint to merge review.
4350

4451
## reviewer verification
45-
- `./.venv/bin/pytest tests/unit/test_continuity_contradictions.py tests/unit/test_20260414_0059_phase12_contradictions_trust_calibration.py tests/unit/test_continuity_recall.py tests/unit/test_continuity_review.py tests/unit/test_cli.py tests/unit/test_mcp.py tests/unit/test_main.py tests/integration/test_contradictions_api.py tests/integration/test_cli_integration.py tests/integration/test_mcp_cli_parity.py -q`
46-
- Result: PASS (`104 passed`)
52+
53+
- `./.venv/bin/pytest tests/unit/test_public_evals.py tests/unit/test_20260414_0060_phase12_public_eval_harness.py tests/unit/test_cli.py tests/unit/test_main.py tests/integration/test_public_evals_api.py tests/integration/test_cli_integration.py tests/integration/test_retrieval_evaluation_api.py -q`
54+
- Result: PASS (`83 passed`)
4755
- `./.venv/bin/python scripts/check_control_doc_truth.py`
4856
- Result: PASS
49-
- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/memory`
57+
- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/evals eval/fixtures eval/baselines`
5058
- Result: PASS (no matches)

0 commit comments

Comments
 (0)