Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family

### Phase 6 — Notebook sequence + adversarial framing
- [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added. Notebook 01 trains LR + HistGBM on the public `intermediate` bundle using the **same feature set as the validation report** (drops only IDs and the label, mirrors `release_quality._partition_columns`), so the G13.2 reproduction gate compares apples to apples. This means notebook 01 **keeps** `total_touches_all` (the documented leakage trap) — narrative cell calls it out explicitly and forward-points to notebook 03 (PR 6.2) which dissects what dropping the trap does to performance. Notebook 02 by contrast **drops** the trap from the flat baseline so the relational lift attribution stays clean (its goal is teaching feature engineering, not reproducing the report). Targets are loaded at runtime from `release/notebooks/_release_targets.json` (audit-synced against `release/validation/validation_report.json` by `tests/release/notebooks/test_release_targets_match_report.py`); per-metric tolerances replace the original flat ±0.05 (AUC/Brier ±0.02, AP / top-decile ±0.05). Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on **train leads only**, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, prints a 4-row metric panel + delta panel, and pins the four model AUCs and the headline `GBM(eng) − GBM(flat)` lift via `assert_within_tolerance` (sign-aware `assert lift > 0` on top of the absolute tolerance). Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive (the cross-seed `gbm_auc` spread on this bundle is ~0.027); seed-sweep harness lands in PR 6.2's notebook 04. Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` and `top_decile_rate` mirror `release_quality._precision_at_k` / `_top_decile_rate` (locked in by mirror tests), and `assert_within_tolerance` is hardened against silent passes on non-finite metrics or incomplete per-metric tolerance maps. G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the intermediate bundle via `python scripts/build_public_release.py release --tier intermediate` (only tier the notebooks need) then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised, gated on bundles-present). G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`. Builders (`scripts/build_release_notebook_{01,02}.py`, sharing `scripts/_release_notebook_common.py`) emit deterministic byte-for-byte notebook JSON via explicit `cell_NNN` IDs (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2, locked in by `tests/scripts/test_release_notebook_builders.py` which builds twice into `tmp_path` via the new `--out PATH` flag and diffs against the committed file without ever touching the working tree) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree. Net: 1250/1250 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
- [ ] `release/notebooks/{03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb`
- [x] PR 6.2: `release/notebooks/03_leakage_and_time_windows.ipynb` and `release/notebooks/04_lift_calibration_value_ranking.ipynb` added. Notebook 03 turns the documented `total_touches_all` trap into a teaching moment: reads the trap label off `feature_dictionary.csv`, proves the trap by construction via a same-table comparison of `total_touches_all` (full-horizon) vs `touch_count` (snapshot-safe) — the post-snapshot delta sums to ~3.2 touches/lead and 82 % of leads have a positive delta — and then runs a standalone-AUC probe on the trap (~0.53 AUC, looks innocuous) followed by a side-by-side full-panel ± trap ablation that shows HistGBM extracts ~+0.032 AUC from the same column LR can only squeeze ~+0.009 from. The reframed pedagogy (vs the prompt's original "trap dominates a thin firmographic set" framing) is empirically driven: firmographic-only is at chance AUC even with the trap, but the GBM-vs-LR asymmetry on the strong panel is a real and useful finding — *standalone AUC probes undersell tree-friendly leakage*. Sign-aware tolerance gate pins each AUC ±0.02 and asserts `gbm_lift > 0.015` so a future regeneration that erases the trap or accidentally amplifies it breaks CI. Notebook 04 covers the four extra ranking lenses AUC alone misses: calibration / reliability diagram (max bin error ≈ 0.13), lift + cumulative gains (top-decile lift 2.75×), value-aware ranking via `expected_acv × P(convert)` (top-50 ACV-capture jumps from 0.16 to 0.40), threshold selection for fixed top-K capacity, cohort-shift evaluation (HistGBM on the first 85 % chronologically → score the last 15 %, mirrors `release_quality.measure_cohort_shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and `model_random_state=0`, **reproduces the report's `cohort_shift.intermediate` block exactly**: 0.8754 / 0.8908 / −0.0155), and a 200-iter bootstrap of the test-set AUC/AP as the within-bundle confidence band that public-bundle consumers (Kaggle / HF) can run without `leadforge` installed (the prompt's "seed-sweep harness" with bootstrap honestly acknowledged as the proxy for true cross-seed sweep, since rebuilding bundles isn't an option for downstream users). Cohort-shift values are pinned via a new `cohort_shift.intermediate` block in `release/notebooks/_release_targets.json` (audit-synced against `validation_report.cohort_shift.intermediate` by a new `test_cohort_shift_targets_match_validation_report` extension to the existing audit-sync test). Headline LR/GBM panel **drops** `total_touches_all` (matches notebook 02's posture, gives honest production numbers); cohort-shift section deliberately **keeps** the trap to reproduce the report's published cohort-shift numbers exactly — divergent posture explained inline. Both new builders (`scripts/build_release_notebook_{03,04}.py`) inherit the deterministic-cell-ID + `--out` byte-stability pattern from PR 6.1 and are added to `_BUILDERS` / `_NOTEBOOKS` in `tests/scripts/test_release_notebook_builders.py` and `tests/release/notebooks/test_execute_notebooks.py`. Both notebooks execute end-to-end in <10s each (well under G13.1's 3-min budget), assert `manifest.exposure_mode == "student_public"` (G13.3), and load only from `release/intermediate/`. Forward-pointer to `docs/release/break_me_guide.md` left as plain backtick-wrapped text — file lands in PR 6.3, no dead Markdown link. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
- [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml`
- [ ] `docs/release/{break_me_guide,v2_decision_log}.md`

Expand Down
Loading
Loading