|
2 | 2 |
|
3 | 3 | ## Sprint Objective |
4 | 4 |
|
5 | | -Implement `P12-S3` contradiction detection and trust calibration so conflicting continuity state becomes reviewable, auditable, and visible in retrieval and explain flows. |
| 5 | +Implement `P12-S4` public eval harness so Alice can run reproducible local eval suites, persist suite/case/run/result records, emit stable baseline report artifacts, and document what the measured quality surface means. |
6 | 6 |
|
7 | 7 | ## Completed Work |
8 | 8 |
|
9 | | -- Added contradiction and trust persistence with `contradiction_cases` and `trust_signals`. |
10 | | -- Added contradiction detection for direct fact, preference, temporal, and source-hierarchy conflicts. |
11 | | -- Added contradiction syncing on continuity create, review, explain, and recall paths. |
12 | | -- Added contradiction-aware retrieval penalties and exposed contradiction counts and penalty scores in recall ordering metadata. |
13 | | -- Added contradiction visibility and active trust-signal counts in continuity explain output. |
14 | | -- Added current-branch contradiction case inspection and resolution flows in API, CLI, and MCP. |
15 | | -- Added current-branch trust signal inspection in API, CLI, and MCP. |
16 | | -- Added focused sprint documentation in `docs/memory/p12-s3-contradictions-trust-calibration.md`, explicitly framed as branch behavior where Control Tower decisions are still pending. |
17 | | -- Added sprint-owned unit and integration coverage for detection, trust persistence, retrieval penalty behavior, explain visibility, CLI smoke, MCP smoke, and migration shape. |
| 9 | +- Added public eval persistence tables for `eval_suites`, `eval_cases`, `eval_runs`, and `eval_results`. |
| 10 | +- Added `alicebot_api.public_evals` with: |
| 11 | + - fixture-catalog loading |
| 12 | + - suite/case syncing into the database |
| 13 | + - fixture-backed recall, resumption, correction, contradiction, and open-loop evaluators |
| 14 | + - canonical report generation with stable digests |
| 15 | + - report writing helper for checked-in baseline artifacts |
| 16 | +- Added current-branch public eval API surfaces: |
| 17 | + - `GET /v1/evals/suites` |
| 18 | + - `POST /v1/evals/runs` |
| 19 | + - `GET /v1/evals/runs` |
| 20 | + - `GET /v1/evals/runs/{eval_run_id}` |
| 21 | +- Made the checked-in fixture catalog authoritative for suite listing and run selection. |
| 22 | +- Added pruning for persisted suite/case rows so removed catalog entries do not survive as stale runtime state. |
| 23 | +- Added explicit validation for unknown `suite_key` filters instead of silently returning partial or empty runs. |
| 24 | +- Added CLI surfaces: |
| 25 | + - `alicebot evals suites` |
| 26 | + - `alicebot evals run` |
| 27 | + - `alicebot evals runs` |
| 28 | + - `alicebot evals show` |
| 29 | +- Added public fixture definitions in `eval/fixtures/public_eval_suites.json`. |
| 30 | +- Added checked-in current-branch baseline report artifact in `eval/baselines/public_eval_harness_v1.json`, with final committed artifact format still pending Control Tower confirmation. |
| 31 | +- Added sprint-owned docs in `docs/evals/public_eval_harness.md`, explicitly framed as current branch behavior where API and artifact decisions are still pending. |
| 32 | +- Added focused unit and integration coverage for the runner, migration, API, CLI, and baseline reproduction path. |
18 | 33 |
|
19 | 34 | ## Incomplete Work |
20 | 35 |
|
21 | | -- None within the sprint packet scope. |
| 36 | +- None inside the sprint packet scope. |
22 | 37 |
|
23 | 38 | ## Files Changed |
24 | 39 |
|
25 | | -- `.ai/handoff/CURRENT_STATE.md` |
26 | | -- `ARCHITECTURE.md` |
27 | 40 | - `BUILD_REPORT.md` |
| 41 | +- `RULES.md` |
| 42 | +- `ARCHITECTURE.md` |
28 | 43 | - `CURRENT_STATE.md` |
| 44 | +- `.ai/handoff/CURRENT_STATE.md` |
29 | 45 | - `PRODUCT_BRIEF.md` |
30 | | -- `REVIEW_REPORT.md` |
31 | 46 | - `ROADMAP.md` |
32 | | -- `apps/api/alembic/versions/20260414_0059_phase12_contradictions_trust_calibration.py` |
33 | | -- `apps/api/src/alicebot_api/continuity_contradictions.py` |
34 | | -- `apps/api/src/alicebot_api/continuity_trust.py` |
35 | | -- `apps/api/src/alicebot_api/store.py` |
| 47 | +- `REVIEW_REPORT.md` |
| 48 | +- `apps/api/alembic/versions/20260414_0060_phase12_public_eval_harness.py` |
| 49 | +- `apps/api/src/alicebot_api/cli.py` |
36 | 50 | - `apps/api/src/alicebot_api/contracts.py` |
37 | | -- `apps/api/src/alicebot_api/continuity_recall.py` |
38 | | -- `apps/api/src/alicebot_api/continuity_explainability.py` |
39 | | -- `apps/api/src/alicebot_api/continuity_evidence.py` |
40 | | -- `apps/api/src/alicebot_api/continuity_objects.py` |
41 | | -- `apps/api/src/alicebot_api/continuity_review.py` |
42 | 51 | - `apps/api/src/alicebot_api/main.py` |
43 | | -- `apps/api/src/alicebot_api/cli.py` |
44 | | -- `apps/api/src/alicebot_api/cli_formatting.py` |
45 | | -- `apps/api/src/alicebot_api/mcp_tools.py` |
46 | | -- `tests/unit/test_20260414_0059_phase12_contradictions_trust_calibration.py` |
47 | | -- `tests/unit/test_continuity_contradictions.py` |
| 52 | +- `apps/api/src/alicebot_api/public_evals.py` |
| 53 | +- `apps/api/src/alicebot_api/store.py` |
| 54 | +- `scripts/check_control_doc_truth.py` |
| 55 | +- `docs/evals/public_eval_harness.md` |
| 56 | +- `eval/baselines/public_eval_harness_v1.json` |
| 57 | +- `eval/fixtures/public_eval_suites.json` |
| 58 | +- `tests/integration/test_cli_integration.py` |
| 59 | +- `tests/integration/test_public_evals_api.py` |
| 60 | +- `tests/unit/test_20260414_0060_phase12_public_eval_harness.py` |
48 | 61 | - `tests/unit/test_cli.py` |
49 | | -- `tests/unit/test_mcp.py` |
50 | 62 | - `tests/unit/test_main.py` |
51 | | -- `tests/integration/test_contradictions_api.py` |
52 | | -- `tests/integration/test_cli_integration.py` |
53 | | -- `tests/integration/test_mcp_cli_parity.py` |
54 | | -- `docs/memory/p12-s3-contradictions-trust-calibration.md` |
55 | | -- `scripts/check_control_doc_truth.py` |
| 63 | +- `tests/unit/test_public_evals.py` |
56 | 64 |
|
57 | 65 | ## Tests Run |
58 | 66 |
|
59 | | -- `./.venv/bin/pytest tests/unit/test_continuity_contradictions.py tests/unit/test_20260414_0059_phase12_contradictions_trust_calibration.py tests/unit/test_continuity_recall.py tests/unit/test_continuity_review.py tests/unit/test_cli.py tests/unit/test_mcp.py tests/unit/test_main.py tests/integration/test_contradictions_api.py tests/integration/test_cli_integration.py tests/integration/test_mcp_cli_parity.py -q` |
60 | | - - Result: PASS (`104 passed`) |
| 67 | +- `./.venv/bin/pytest tests/unit/test_public_evals.py tests/unit/test_20260414_0060_phase12_public_eval_harness.py tests/unit/test_cli.py tests/unit/test_main.py tests/integration/test_public_evals_api.py tests/integration/test_cli_integration.py tests/integration/test_retrieval_evaluation_api.py -q` |
| 68 | + - Result: PASS (`83 passed`) |
61 | 69 | - `./.venv/bin/python scripts/check_control_doc_truth.py` |
62 | 70 | - Result: PASS |
63 | | -- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/memory` |
| 71 | +- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/evals eval/fixtures eval/baselines` |
64 | 72 | - Result: PASS (no matches) |
65 | 73 |
|
66 | 74 | ## Blockers/Issues |
67 | 75 |
|
68 | 76 | - No sprint blocker remains. |
69 | | -- Final product policy is still pending for the Control Tower decisions called out in the sprint packet, including contradiction attachment scope, long-term API shape, and the durable trust-signal policy boundary. |
| 77 | +- The recall suite keeps one non-gating coverage snapshot for entity-edge expansion. It records the current shipped output with `score=0.0` while the suite still passes because the catalog marks that case as observational rather than a strict gate. |
| 78 | +- Final product policy is still pending for the Control Tower decisions called out in the sprint packet, including the committed artifact format and whether `/v1/evals/*` remains part of the accepted Phase 12 surface. |
70 | 79 |
|
71 | 80 | ## Recommended Next Step |
72 | 81 |
|
73 | | -Request Control Tower merge review against the current `P12-S3` branch head. |
| 82 | +Request Control Tower merge review against the current `P12-S4` branch head. |
0 commit comments