|
1 | 1 | # BUILD_REPORT |
2 | 2 |
|
3 | 3 | ## Sprint Objective |
4 | | - |
5 | | -Implement `P12-S4` public eval harness so Alice can run reproducible local eval suites, persist suite/case/run/result records, emit stable baseline report artifacts, and document what the measured quality surface means. |
| 4 | +Implement `P12-S5` task-adaptive briefing so the system can generate deterministic, explainable, role-specific context packs for `user_recall`, `resume`, `worker_subtask`, and `agent_handoff`, while preserving shipped retrieval, mutation, contradiction, trust, and eval behavior. |
6 | 5 |
|
7 | 6 | ## Completed Work |
8 | | - |
9 | | -- Added public eval persistence tables for `eval_suites`, `eval_cases`, `eval_runs`, and `eval_results`. |
10 | | -- Added `alicebot_api.public_evals` with: |
11 | | - - fixture-catalog loading |
12 | | - - suite/case syncing into the database |
13 | | - - fixture-backed recall, resumption, correction, contradiction, and open-loop evaluators |
14 | | - - canonical report generation with stable digests |
15 | | - - report writing helper for checked-in baseline artifacts |
16 | | -- Added current-branch public eval API surfaces: |
17 | | - - `GET /v1/evals/suites` |
18 | | - - `POST /v1/evals/runs` |
19 | | - - `GET /v1/evals/runs` |
20 | | - - `GET /v1/evals/runs/{eval_run_id}` |
21 | | -- Made the checked-in fixture catalog authoritative for suite listing and run selection. |
22 | | -- Added pruning for persisted suite/case rows so removed catalog entries do not survive as stale runtime state. |
23 | | -- Added explicit validation for unknown `suite_key` filters instead of silently returning partial or empty runs. |
24 | | -- Added CLI surfaces: |
25 | | - - `alicebot evals suites` |
26 | | - - `alicebot evals run` |
27 | | - - `alicebot evals runs` |
28 | | - - `alicebot evals show` |
29 | | -- Added public fixture definitions in `eval/fixtures/public_eval_suites.json`. |
30 | | -- Added checked-in current-branch baseline report artifact in `eval/baselines/public_eval_harness_v1.json`, with final committed artifact format still pending Control Tower confirmation. |
31 | | -- Added sprint-owned docs in `docs/evals/public_eval_harness.md`, explicitly framed as current branch behavior where API and artifact decisions are still pending. |
32 | | -- Added focused unit and integration coverage for the runner, migration, API, CLI, and baseline reproduction path. |
| 7 | +- Added a dedicated task briefing compiler with four briefing modes. |
| 8 | +- Added deterministic briefing summaries, selection rules, truncation metadata, token budgeting, and comparison output. |
| 9 | +- Added task brief persistence through a new `task_briefs` table. |
| 10 | +- Added current-branch API surfaces for task-brief compile, inspect, and compare. |
| 11 | +- Added CLI surfaces for task-brief compile, inspect, and compare. |
| 12 | +- Added MCP tools for task-brief compile, inspect, and compare. |
| 13 | +- Added model-pack briefing defaults through `briefing_strategy` and `briefing_max_tokens`, and task-brief compilation now resolves those defaults when a workspace-selected model pack is available. |
| 14 | +- Added focused docs under `docs/briefing/`, explicitly framed as current branch behavior where briefing payload and surface-shape decisions are still pending. |
| 15 | +- Added unit and integration coverage for determinism, size reduction, persistence, CLI smoke, MCP smoke, API behavior, migration shape, and model-pack strategy fields. |
33 | 16 |
|
34 | 17 | ## Incomplete Work |
35 | | - |
36 | | -- None inside the sprint packet scope. |
| 18 | +- None within the sprint packet scope. |
37 | 19 |
|
38 | 20 | ## Files Changed |
39 | | - |
40 | | -- `BUILD_REPORT.md` |
41 | | -- `RULES.md` |
| 21 | +- `.ai/handoff/CURRENT_STATE.md` |
42 | 22 | - `ARCHITECTURE.md` |
| 23 | +- `BUILD_REPORT.md` |
43 | 24 | - `CURRENT_STATE.md` |
44 | | -- `.ai/handoff/CURRENT_STATE.md` |
45 | 25 | - `PRODUCT_BRIEF.md` |
46 | | -- `ROADMAP.md` |
47 | 26 | - `REVIEW_REPORT.md` |
48 | | -- `apps/api/alembic/versions/20260414_0060_phase12_public_eval_harness.py` |
49 | | -- `apps/api/src/alicebot_api/cli.py` |
| 27 | +- `ROADMAP.md` |
| 28 | +- `RULES.md` |
| 29 | +- `apps/api/src/alicebot_api/task_briefing.py` |
50 | 30 | - `apps/api/src/alicebot_api/contracts.py` |
51 | | -- `apps/api/src/alicebot_api/main.py` |
52 | | -- `apps/api/src/alicebot_api/public_evals.py` |
53 | 31 | - `apps/api/src/alicebot_api/store.py` |
54 | | -- `scripts/check_control_doc_truth.py` |
55 | | -- `docs/evals/public_eval_harness.md` |
56 | | -- `eval/baselines/public_eval_harness_v1.json` |
57 | | -- `eval/fixtures/public_eval_suites.json` |
58 | | -- `tests/integration/test_cli_integration.py` |
59 | | -- `tests/integration/test_public_evals_api.py` |
60 | | -- `tests/unit/test_20260414_0060_phase12_public_eval_harness.py` |
| 32 | +- `apps/api/src/alicebot_api/model_packs.py` |
| 33 | +- `apps/api/src/alicebot_api/main.py` |
| 34 | +- `apps/api/src/alicebot_api/cli.py` |
| 35 | +- `apps/api/src/alicebot_api/cli_formatting.py` |
| 36 | +- `apps/api/src/alicebot_api/mcp_tools.py` |
| 37 | +- `apps/api/alembic/versions/20260414_0061_phase12_task_adaptive_briefing.py` |
| 38 | +- `docs/briefing/task-adaptive-briefing.md` |
| 39 | +- `tests/unit/test_task_briefing.py` |
| 40 | +- `tests/unit/test_model_packs.py` |
61 | 41 | - `tests/unit/test_cli.py` |
62 | | -- `tests/unit/test_main.py` |
63 | | -- `tests/unit/test_public_evals.py` |
| 42 | +- `tests/unit/test_mcp.py` |
| 43 | +- `tests/unit/test_20260414_0061_phase12_task_adaptive_briefing.py` |
| 44 | +- `tests/integration/test_task_briefing_api.py` |
| 45 | +- `tests/integration/test_cli_integration.py` |
| 46 | +- `tests/integration/test_mcp_cli_parity.py` |
| 47 | +- `tests/integration/test_mcp_server.py` |
| 48 | +- `tests/integration/test_phase11_model_packs_api.py` |
| 49 | +- `scripts/check_control_doc_truth.py` |
64 | 50 |
|
65 | 51 | ## Tests Run |
66 | | - |
67 | | -- `./.venv/bin/pytest tests/unit/test_public_evals.py tests/unit/test_20260414_0060_phase12_public_eval_harness.py tests/unit/test_cli.py tests/unit/test_main.py tests/integration/test_public_evals_api.py tests/integration/test_cli_integration.py tests/integration/test_retrieval_evaluation_api.py -q` |
68 | | - - Result: PASS (`83 passed`) |
| 52 | +- `./.venv/bin/pytest tests/unit/test_task_briefing.py tests/unit/test_model_packs.py tests/unit/test_cli.py tests/unit/test_mcp.py tests/unit/test_20260414_0061_phase12_task_adaptive_briefing.py tests/unit/test_continuity_resumption.py tests/unit/test_continuity_recall.py tests/unit/test_public_evals.py tests/integration/test_task_briefing_api.py tests/integration/test_cli_integration.py tests/integration/test_mcp_cli_parity.py tests/integration/test_phase11_model_packs_api.py tests/integration/test_mcp_server.py tests/integration/test_public_evals_api.py tests/integration/test_continuity_resumption_api.py tests/integration/test_retrieval_evaluation_api.py -q` |
| 53 | + - Result: PASS (`73 passed`) |
69 | 54 | - `./.venv/bin/python scripts/check_control_doc_truth.py` |
70 | 55 | - Result: PASS |
71 | | -- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/evals eval/fixtures eval/baselines` |
| 56 | +- `rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/briefing` |
72 | 57 | - Result: PASS (no matches) |
73 | 58 |
|
74 | 59 | ## Blockers/Issues |
75 | | - |
76 | | -- No sprint blocker remains. |
77 | | -- The recall suite keeps one non-gating coverage snapshot for entity-edge expansion. It records the current shipped output with `score=0.0` while the suite still passes because the catalog marks that case as observational rather than a strict gate. |
78 | | -- Final product policy is still pending for the Control Tower decisions called out in the sprint packet, including the committed artifact format and whether `/v1/evals/*` remains part of the accepted Phase 12 surface. |
| 60 | +- No remaining blockers. |
| 61 | +- Final product policy is still pending for the Control Tower decisions called out in the sprint packet, including the canonical persisted briefing payload shape, required model-pack briefing fields, and whether generation and comparison APIs should both ship in `P12-S5`. |
79 | 62 |
|
80 | 63 | ## Recommended Next Step |
81 | | - |
82 | | -Request Control Tower merge review against the current `P12-S4` branch head. |
| 64 | +Request Control Tower merge review against the current `P12-S5` branch head. |
0 commit comments