Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family

### Phase 3 — Release validation hardening
- [x] PR 3.1: `leadforge/validation/leakage_probes.py` (new) — unified leakage taxonomy. Subsumes the PR 2.1 `relational_leakage` module and broadens it to the full design-doc / acceptance-gates taxonomy: direct (banned columns / banned tables, generalised to accept caller-supplied banned sets), time-window (`probe_snapshot_window`, generalised over `(table, ts_col)` pairs), relational (`probe_deterministic_reconstruction`, `deterministic_relational_reconstruction`), split (`probe_split_id_overlap` for G6.1/G6.2, `probe_split_near_duplicates` via deterministic rounded-vector hashing for G6.3, `probe_split_label_drift` opt-in), model-realism (`probe_bonus_model_auc` opt-in, new opt-in `probe_id_only_baseline` for G5.3, `probe_feature_subset_baseline` for G5.1/G5.2). `PROBE_REGISTRY` is the single source of truth (probe → taxonomy / opt-in flag); meta-test asserts every module-level `probe_*` is registered. Two orchestrators: `run_all_probes` / `run_all_probes_on_dataframes` (structural, kept stable for `validate_bundle`) and new `run_split_probes` (split-level over `{split_name: DataFrame}`). `relational_leakage.py` deleted; every internal call site updated (`leadforge/validation/{bundle_checks,invariants}.py`, `leadforge/render/{manifests,relational_snapshot_safe}.py`, `leadforge/exposure/filters.py` doc, `scripts/probe_relational_leakage.py`); test file renamed `test_relational_leakage.py` → `test_leakage_probes.py` and grew 24 new tests for the new probes + meta-coverage. `RelationalLeakageError` retained (now spans every taxonomy) with `LeakageError` alias for the new umbrella name. `BUNDLE_SCHEMA_VERSION` unchanged (purely additive on the validator side); 1067/1067 tests pass; hash-determinism preserved (67/67 files identical); `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier.
- [ ] PR 3.2: `leadforge/validation/{release_quality,reporting}.py` (new)
- [x] PR 3.2: `leadforge/validation/release_quality.py` + `leadforge/validation/reporting.py` (new). `release_quality.py` produces a structured `ReleaseQualityReport` (JSON-primitive `TierMetrics` / `CrossSeedTierMetrics` / `CohortShiftMetrics` / `CrossTierOrdering` dataclasses) covering G7.* (per-tier ROC-AUC, PR-AUC, log loss, Brier, calibration bins, P@K / R@K, lift@{1,5,10}%, top-decile rate, expected-ACV capture, LR-vs-HistGBM delta, source/engagement/stage/post-snapshot/ID-only baseline AUCs), G8.1 (cross-seed median + spread bands), G6.4 (random-vs-chronological cohort-shift split with HistGBM), and G7.4.* (cross-tier ordering booleans + descending rankings). `TierBuildSpec.from_bundle` + idempotent `regenerate_tier_for_seeds(spec, seeds, workdir)` orchestrate cross-seed rebuilds via `Generator.from_recipe`. `reporting.py` ships `render_report(report, output_dir)` writing `validation_report.json` (deterministic `dataclasses.asdict` + sorted-keys `json.dumps`, NaN→null), `validation_report.md` (every metric cell carries a `$.tiers.<tier>.medians.<field>` JSON-path citation per G10.6), and the pinned figure set (`lift_curve_{intro,intermediate,advanced}.png`, `calibration_intermediate.png`, `leakage_delta.png`, `cohort_shift.png`, `value_capture.png`) under the Agg backend. New deps: `matplotlib>=3.7` added to `[scripts]` and `[dev]` extras (mypy override too). `pyproject.toml` mypy override added. 28 new tests across `tests/validation/test_release_quality.py`, `tests/validation/test_reporting.py`, and `tests/integration/test_release_quality_round_trip.py` (synthetic minimal bundles + N=2 round-trip via `Generator.from_recipe(...).generate(_SMALL).save(...)`); 1095/1095 tests pass; ruff + mypy clean; hash-determinism preserved (67/67 files identical); `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` still exits 0 on every public tier; `BUNDLE_SCHEMA_VERSION` unchanged (purely additive layer on top of the validator/reporting stack).
- [ ] PR 3.3: `scripts/validate_release_candidate.py` (new); resolve numeric `TBD-*` bands in `v1_acceptance_gates.md`; `release/validation/validation_report.{json,md}` + figures auto-generated

### Phase 4 — Channel-signal audit + dataset card hardening
Expand Down
Loading
Loading