samrusani
diff --git a/‎.ai/handoff/CURRENT_STATE.md‎
Lines changed: 8 additions & 7 deletions b/‎.ai/handoff/CURRENT_STATE.md‎
Lines changed: 8 additions & 7 deletions
diff --git a/‎ARCHITECTURE.md‎
Lines changed: 7 additions & 4 deletions b/‎ARCHITECTURE.md‎
Lines changed: 7 additions & 4 deletions
diff --git a/‎BUILD_REPORT.md‎
Lines changed: 48 additions & 39 deletions b/‎BUILD_REPORT.md‎
Lines changed: 48 additions & 39 deletions
diff --git a/‎CURRENT_STATE.md‎
Lines changed: 8 additions & 7 deletions b/‎CURRENT_STATE.md‎
Lines changed: 8 additions & 7 deletions
diff --git a/‎PRODUCT_BRIEF.md‎
Lines changed: 2 additions & 1 deletion b/‎PRODUCT_BRIEF.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎REVIEW_REPORT.md‎
Lines changed: 32 additions & 24 deletions b/‎REVIEW_REPORT.md‎
Lines changed: 32 additions & 24 deletions
@@ -8,7 +8,8 @@
 - `v0.2.0` is released.
 - Phase 12 Sprint 1 (`P12-S1`) is shipped.
 - Phase 12 Sprint 2 (`P12-S2`) is shipped.
-- Phase 12 Sprint 3 (`P12-S3`) is the active execution sprint.
+- Phase 12 Sprint 3 (`P12-S3`) is shipped.
+- Phase 12 Sprint 4 (`P12-S4`) is the active execution sprint.
 
 ## Current Baseline Truth
 - Alice has typed memory, provenance, trust classes, correction/supersession behavior, open loops, recall, resumption, and explainability.
@@ -17,17 +18,17 @@
 - The codebase also includes the shipped `P12-S2` memory mutation candidate and operation foundation.
 
 ## Not Yet First-Class In Repo
-- public multi-suite eval harness for recall/resumption/correction/contradiction/open-loops
 - task-adaptive brief compiler separated from current briefing surfaces
 
 ## Phase Transition Note
 - Phase 12 is active.
 - `P12-S1` is complete and establishes the retrieval baseline.
 - `P12-S2` is complete and establishes the mutation baseline.
-- `P12-S3` is the active sprint and should build contradiction and trust handling on top of shipped retrieval and mutation behavior.
-- The current `P12-S3` branch implements contradiction cases and trust-signal storage, pending Control Tower merge approval.
+- `P12-S3` is complete and establishes the contradiction/trust baseline.
+- `P12-S4` is the active sprint and should benchmark shipped retrieval, mutation, and contradiction behavior without reopening those systems.
+- The current `P12-S4` branch implements the public eval harness, fixture catalog, and checked-in baseline artifact, pending Control Tower merge approval.
 
 ## Immediate Control Tower Decisions Needed
-- Decide contradiction object attachment scope: continuity objects, memories, or both.
-- Decide trust-signal storage and ranking integration policy.
-- Decide the final contradiction and trust API surface shape for Phase 12.
+- Decide public eval suite taxonomy and baseline artifact format.
+- Decide what eval artifacts are committed versus generated locally.
+- Decide whether `P12-S4` stays CLI-first or keeps the current branch `/v1/evals/*` API surface.
@@ -2,7 +2,7 @@
 
 ## Scope Boundary
 - **Shipped baseline:** Phases 9-11 and Bridge `B1` through `B4`.
-- **Current repo execution posture:** `v0.2.0` is released; `P12-S1` and `P12-S2` are shipped; `P12-S3` is the active sprint.
+- **Current repo execution posture:** `v0.2.0` is released; `P12-S1`, `P12-S2`, and `P12-S3` are shipped; `P12-S4` is the active sprint.
 - **Phase 12 delta:** retrieval quality, mutation explicitness, contradiction handling, public evals, and adaptive briefing.
 
 ## Current System Overview
@@ -113,13 +113,13 @@ Delivered additions:
 Important baseline note: `P12-S2` is now the mutation baseline for the rest of Phase 12 and should not be reopened except where later sprint integration requires it.
 
 ### P12-S3: Contradiction Detection + Trust Calibration
-Add first-class conflict records and auditable trust adjustments.
+Shipped in `P12-S3`:
 
-Planned additions:
+Delivered additions:
 - `contradiction_cases`
 - `trust_signals`
 
-Important baseline note: `P12-S3` should layer on top of shipped retrieval traces and shipped mutation operations rather than redesigning either subsystem.
+Important baseline note: `P12-S3` is now the contradiction/trust baseline for the rest of Phase 12 and should not be reopened except where later sprint integration requires it.
 
 ### P12-S4: Public Eval Harness
 Expand the current retrieval evaluation foundation into public multi-suite benchmark runs and checked-in baseline reports.
@@ -130,6 +130,9 @@ Planned additions:
 - `eval_runs`
 - `eval_results`
 
+Important baseline note: `P12-S4` should measure shipped retrieval, mutation, and contradiction behavior rather than redesign those systems.
+Source-of-truth note: the checked-in fixture catalog defines the authoritative suite/case set and ordering; persisted eval suite/case rows are synchronized snapshots for execution and audit, not an independent planning surface.
+
 ### P12-S5: Task-Adaptive Briefing
 Separate durable memory from output-specific briefing layers.
 
 
@@ -2,72 +2,81 @@
 
 ## Sprint Objective
 
-Implement `P12-S3` contradiction detection and trust calibration so conflicting continuity state becomes reviewable, auditable, and visible in retrieval and explain flows.
+Implement `P12-S4` public eval harness so Alice can run reproducible local eval suites, persist suite/case/run/result records, emit stable baseline report artifacts, and document what the measured quality surface means.
 
 ## Completed Work
 
-- Added contradiction and trust persistence with `contradiction_cases` and `trust_signals`.
-- Added contradiction detection for direct fact, preference, temporal, and source-hierarchy conflicts.
-- Added contradiction syncing on continuity create, review, explain, and recall paths.
-- Added contradiction-aware retrieval penalties and exposed contradiction counts and penalty scores in recall ordering metadata.
-- Added contradiction visibility and active trust-signal counts in continuity explain output.
-- Added current-branch contradiction case inspection and resolution flows in API, CLI, and MCP.
-- Added current-branch trust signal inspection in API, CLI, and MCP.
-- Added focused sprint documentation in `docs/memory/p12-s3-contradictions-trust-calibration.md`, explicitly framed as branch behavior where Control Tower decisions are still pending.
-- Added sprint-owned unit and integration coverage for detection, trust persistence, retrieval penalty behavior, explain visibility, CLI smoke, MCP smoke, and migration shape.
+- Added public eval persistence tables for `eval_suites`, `eval_cases`, `eval_runs`, and `eval_results`.
+- Added `alicebot_api.public_evals` with:
+  - fixture-catalog loading
+  - suite/case syncing into the database
+  - fixture-backed recall, resumption, correction, contradiction, and open-loop evaluators
+  - canonical report generation with stable digests
+  - report writing helper for checked-in baseline artifacts
+- Added current-branch public eval API surfaces:
+  - `GET /v1/evals/suites`
+  - `POST /v1/evals/runs`
+  - `GET /v1/evals/runs`
+  - `GET /v1/evals/runs/{eval_run_id}`
+- Made the checked-in fixture catalog authoritative for suite listing and run selection.
+- Added pruning for persisted suite/case rows so removed catalog entries do not survive as stale runtime state.
+- Added explicit validation for unknown `suite_key` filters instead of silently returning partial or empty runs.
+- Added CLI surfaces:
+  - `alicebot evals suites`
+  - `alicebot evals run`
+  - `alicebot evals runs`
+  - `alicebot evals show`
+- Added public fixture definitions in `eval/fixtures/public_eval_suites.json`.
+- Added checked-in current-branch baseline report artifact in `eval/baselines/public_eval_harness_v1.json`, with final committed artifact format still pending Control Tower confirmation.
+- Added sprint-owned docs in `docs/evals/public_eval_harness.md`, explicitly framed as current branch behavior where API and artifact decisions are still pending.
+- Added focused unit and integration coverage for the runner, migration, API, CLI, and baseline reproduction path.
 
 ## Incomplete Work
 
-- None within the sprint packet scope.
+- None inside the sprint packet scope.
 
 ## Files Changed
 
-- `.ai/handoff/CURRENT_STATE.md`
-- `ARCHITECTURE.md`
 - `BUILD_REPORT.md`
+- `RULES.md`
+- `ARCHITECTURE.md`
 - `CURRENT_STATE.md`
+- `.ai/handoff/CURRENT_STATE.md`
 - `PRODUCT_BRIEF.md`
-- `REVIEW_REPORT.md`
 - `ROADMAP.md`
-- `apps/api/alembic/versions/20260414_0059_phase12_contradictions_trust_calibration.py`
-- `apps/api/src/alicebot_api/continuity_contradictions.py`
-- `apps/api/src/alicebot_api/continuity_trust.py`
-- `apps/api/src/alicebot_api/store.py`
+- `REVIEW_REPORT.md`
+- `apps/api/alembic/versions/20260414_0060_phase12_public_eval_harness.py`
+- `apps/api/src/alicebot_api/cli.py`
 - `apps/api/src/alicebot_api/contracts.py`
-- `apps/api/src/alicebot_api/continuity_recall.py`
-- `apps/api/src/alicebot_api/continuity_explainability.py`
-- `apps/api/src/alicebot_api/continuity_evidence.py`
-- `apps/api/src/alicebot_api/continuity_objects.py`
-- `apps/api/src/alicebot_api/continuity_review.py`
 - `apps/api/src/alicebot_api/main.py`
-- `apps/api/src/alicebot_api/cli.py`
-- `apps/api/src/alicebot_api/cli_formatting.py`
-- `apps/api/src/alicebot_api/mcp_tools.py`
-- `tests/unit/test_20260414_0059_phase12_contradictions_trust_calibration.py`
-- `tests/unit/test_continuity_contradictions.py`
+- `apps/api/src/alicebot_api/public_evals.py`
+- `apps/api/src/alicebot_api/store.py`
+- `scripts/check_control_doc_truth.py`
+- `docs/evals/public_eval_harness.md`
+- `eval/baselines/public_eval_harness_v1.json`
+- `eval/fixtures/public_eval_suites.json`
+- `tests/integration/test_cli_integration.py`
+- `tests/integration/test_public_evals_api.py`
+- `tests/unit/test_20260414_0060_phase12_public_eval_harness.py`
 - `tests/unit/test_cli.py`
-- `tests/unit/test_mcp.py`
 - `tests/unit/test_main.py`
-- `tests/integration/test_contradictions_api.py`
-- `tests/integration/test_cli_integration.py`
-- `tests/integration/test_mcp_cli_parity.py`
-- `docs/memory/p12-s3-contradictions-trust-calibration.md`
-- `scripts/check_control_doc_truth.py`
+- `tests/unit/test_public_evals.py`
 
 ## Tests Run
 
-- `./.venv/bin/pytest tests/unit/test_continuity_contradictions.py tests/unit/test_20260414_0059_phase12_contradictions_trust_calibration.py tests/unit/test_continuity_recall.py tests/unit/test_continuity_review.py tests/unit/test_cli.py tests/unit/test_mcp.py tests/unit/test_main.py tests/integration/test_contradictions_api.py tests/integration/test_cli_integration.py tests/integration/test_mcp_cli_parity.py -q`
-  - Result: PASS (`104 passed`)
+- `./.venv/bin/pytest tests/unit/test_public_evals.py tests/unit/test_20260414_0060_phase12_public_eval_harness.py tests/unit/test_cli.py tests/unit/test_main.py tests/integration/test_public_evals_api.py tests/integration/test_cli_integration.py tests/integration/test_retrieval_evaluation_api.py -q`
+  - Result: PASS (`83 passed`)
 - `./.venv/bin/python scripts/check_control_doc_truth.py`
   - Result: PASS
-- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/memory`
+- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/evals eval/fixtures eval/baselines`
   - Result: PASS (no matches)
 
 ## Blockers/Issues
 
 - No sprint blocker remains.
-- Final product policy is still pending for the Control Tower decisions called out in the sprint packet, including contradiction attachment scope, long-term API shape, and the durable trust-signal policy boundary.
+- The recall suite keeps one non-gating coverage snapshot for entity-edge expansion. It records the current shipped output with `score=0.0` while the suite still passes because the catalog marks that case as observational rather than a strict gate.
+- Final product policy is still pending for the Control Tower decisions called out in the sprint packet, including the committed artifact format and whether `/v1/evals/*` remains part of the accepted Phase 12 surface.
 
 ## Recommended Next Step
 
-Request Control Tower merge review against the current `P12-S3` branch head.
+Request Control Tower merge review against the current `P12-S4` branch head.
@@ -11,7 +11,8 @@ Canonical handoff state lives at [.ai/handoff/CURRENT_STATE.md](.ai/handoff/CURR
 - `v0.2.0` is released.
 - Phase 12 Sprint 1 (`P12-S1`) is shipped.
 - Phase 12 Sprint 2 (`P12-S2`) is shipped.
-- Phase 12 Sprint 3 (`P12-S3`) is the active execution sprint.
+- Phase 12 Sprint 3 (`P12-S3`) is shipped.
+- Phase 12 Sprint 4 (`P12-S4`) is the active execution sprint.
 
 ## Current Baseline Truth
 - Alice has typed memory, provenance, trust classes, correction/supersession behavior, open loops, recall, resumption, and explainability.
@@ -20,17 +21,17 @@ Canonical handoff state lives at [.ai/handoff/CURRENT_STATE.md](.ai/handoff/CURR
 - The codebase also includes the shipped `P12-S2` memory mutation candidate and operation foundation.
 
 ## Not Yet First-Class In Repo
-- public multi-suite eval harness for recall/resumption/correction/contradiction/open-loops
 - task-adaptive brief compiler separated from current briefing surfaces
 
 ## Phase Transition Note
 - Phase 12 is active.
 - `P12-S1` is complete and establishes the retrieval baseline.
 - `P12-S2` is complete and establishes the mutation baseline.
-- `P12-S3` is the active sprint and should build contradiction and trust handling on top of shipped retrieval and mutation behavior.
-- The current `P12-S3` branch implements contradiction cases and trust-signal storage, pending Control Tower merge approval.
+- `P12-S3` is complete and establishes the contradiction/trust baseline.
+- `P12-S4` is the active sprint and should benchmark shipped retrieval, mutation, and contradiction behavior without reopening those systems.
+- The current `P12-S4` branch implements the public eval harness, fixture catalog, and checked-in baseline artifact, pending Control Tower merge approval.
 
 ## Immediate Control Tower Decisions Needed
-- Decide contradiction object attachment scope: continuity objects, memories, or both.
-- Decide trust-signal storage and ranking integration policy.
-- Decide the final contradiction and trust API surface shape for Phase 12.
+- Decide public eval suite taxonomy and baseline artifact format.
+- Decide what eval artifacts are committed versus generated locally.
+- Decide whether `P12-S4` stays CLI-first or keeps the current branch `/v1/evals/*` API surface.
@@ -15,7 +15,8 @@ Alice is a pre-1.0 continuity platform for AI agents and agent-assisted workflow
 - Phase 12 is active.
 - `P12-S1` Hybrid Retrieval + Reranking is shipped.
 - `P12-S2` Automated Memory Operations is shipped.
-- `P12-S3` Contradiction Detection + Trust Calibration is the active sprint.
+- `P12-S3` Contradiction Detection + Trust Calibration is shipped.
+- `P12-S4` Public Eval Harness is the active sprint.
 
 ## Next Phase
 ### Phase 12: Retrieval Quality + Adaptive Continuity
 
@@ -1,50 +1,58 @@
-# REVIEW REPORT
+# REVIEW_REPORT
 
 ## verdict
+
 PASS
 
 ## criteria met
-- Obvious contradictory facts are flagged automatically and persisted as contradiction cases.
-- Contradiction status appears in explain output, including open/resolved counts and penalty score.
-- Unresolved contradictions reduce retrieval confidence/rank through contradiction penalty integration.
-- Trust changes are stored and inspectable through API, CLI, and MCP trust-signal surfaces.
-- Contradiction review actions are auditable with stored resolution action, note, and timestamps.
-- `P12-S3` layers onto shipped retrieval and correction/mutation behavior without reopening those systems.
-- No local workstation paths, usernames, or machine-specific identifiers were found in the reviewed changed files or sprint docs.
+
+- The sprint now runs the public eval harness end to end locally through both CLI and API surfaces.
+- The checked-in fixture catalog is authoritative for suite listing and run selection.
+- Eval runs now prune removed suite/case definitions from persisted sync state, so the effective run set stays aligned with the checked-in catalog.
+- Unknown `suite_key` filters now fail fast instead of silently producing partial or empty runs.
+- The repo includes the checked-in baseline report artifact and sprint-owned eval docs.
+- The sprint still measures shipped retrieval, mutation, and contradiction behavior without reopening those systems.
+- Verification passed:
+  - `./.venv/bin/pytest tests/unit/test_public_evals.py tests/unit/test_20260414_0060_phase12_public_eval_harness.py tests/unit/test_cli.py tests/unit/test_main.py tests/integration/test_public_evals_api.py tests/integration/test_cli_integration.py tests/integration/test_retrieval_evaluation_api.py -q`
+  - `./.venv/bin/python scripts/check_control_doc_truth.py`
+- I re-checked the changed files and did not find local workstation paths, usernames, or similar machine-specific identifiers in the sprint-owned docs or artifacts.
 
 ## criteria missed
+
 - None.
 
 ## quality issues
-- Minor scope spill remains in control-doc churn outside the sprint-owned runtime surface, but it is not causing behavioral or acceptance risk.
+
+- No blocking quality issue remains in the sprint scope.
+- The recall suite still carries one observational snapshot case for entity-edge expansion with `score=0.0` while the suite passes by design. That is documented and not a regression, but it remains a product decision for a later sprint.
 
 ## regression risks
-- Low after the follow-up fix.
-- The main previously identified risks are now covered by regression tests:
-  - superseded history no longer reopens active contradictions
-  - naive temporal ISO values are normalized before overlap detection
+
+- Low. The main prior risk around stale catalog state persisting across runs is addressed by pruning and by reading suite definitions directly from the checked-in fixture catalog for listing.
 
 ## docs issues
+
 - None blocking.
-- Sprint docs now clarify that contradiction detection only uses live continuity objects (`active` and `stale`) and normalizes temporal bounds to UTC.
-- Sprint docs frame contradiction attachment, trust-ledger durability, and API surface choices as current branch behavior where Control Tower decisions are still pending, rather than as permanently settled product policy.
-- The local-path scrub command excludes `BUILD_REPORT.md` and `REVIEW_REPORT.md` because the reports now carry the literal search pattern as part of their recorded verification steps.
+- `BUILD_REPORT.md` now reflects the control-doc updates and the follow-up verification run.
+- Sprint docs now frame `/v1/evals/*` and the checked-in JSON baseline report as current branch behavior where the final Control Tower contract is still pending, rather than silently treating those choices as permanently settled product policy.
 
 ## should anything be added to RULES.md?
-- No required update.
+
+- Already addressed in this branch. `RULES.md` now states that canonical fixture catalogs must either be pruned into runtime state or used directly as the source of truth.
 
 ## should anything update ARCHITECTURE.md?
-- No required update for sprint acceptance.
-- When `P12-S3` becomes shipped baseline truth, the data-model summary should explicitly include `contradiction_cases` and `trust_signals`.
+
+- Already addressed in this branch. `ARCHITECTURE.md` now states that the checked-in fixture catalog is authoritative and persisted suite/case rows are synchronized snapshots.
 
 ## recommended next action
-- Proceed with merge review for `P12-S3`.
-- Carry the new superseded-history and naive-temporal contradiction cases forward in future regression suites.
+
+- Pass the sprint to merge review.
 
 ## reviewer verification
-- `./.venv/bin/pytest tests/unit/test_continuity_contradictions.py tests/unit/test_20260414_0059_phase12_contradictions_trust_calibration.py tests/unit/test_continuity_recall.py tests/unit/test_continuity_review.py tests/unit/test_cli.py tests/unit/test_mcp.py tests/unit/test_main.py tests/integration/test_contradictions_api.py tests/integration/test_cli_integration.py tests/integration/test_mcp_cli_parity.py -q`
-  - Result: PASS (`104 passed`)
+
+- `./.venv/bin/pytest tests/unit/test_public_evals.py tests/unit/test_20260414_0060_phase12_public_eval_harness.py tests/unit/test_cli.py tests/unit/test_main.py tests/integration/test_public_evals_api.py tests/integration/test_cli_integration.py tests/integration/test_retrieval_evaluation_api.py -q`
+  - Result: PASS (`83 passed`)
 - `./.venv/bin/python scripts/check_control_doc_truth.py`
   - Result: PASS
-- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/memory`
+- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/evals eval/fixtures eval/baselines`
   - Result: PASS (no matches)