From 88c1c8f0bdaabbf2e9b52d897af0741c498f6c4e Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Thu, 7 May 2026 12:07:31 +0300 Subject: [PATCH 1/3] PR 6.2: notebooks 03 (leakage + time windows) + 04 (lift / calibration / value / cohort) + bootstrap robustness harness MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds the second pair of release notebooks: a leakage walkthrough that turns the documented `total_touches_all` trap into a teachable contrast, and a value-aware ranking notebook that covers calibration, lift, expected_acv-aware top-K policies, threshold selection, cohort shift, and a within-bundle bootstrap. Notebook 03 — leakage and time windows - Reads the trap label off `feature_dictionary.csv`; proves the trap by construction via a same-table comparison of `total_touches_all` (full-horizon) vs `touch_count` (snapshot-safe). Mean post-snapshot delta is 3.2 touches/lead; 82 % of leads have a positive delta. - Standalone-AUC probe on the trap (~0.53 — looks innocuous) ⇒ full-panel ± trap ablation showing HistGBM extracts +0.032 AUC where LR only squeezes +0.009. Pedagogical headline: *standalone AUC probes undersell tree-friendly leakage*. - Sign-aware tolerance gate pins each AUC ±0.02 and asserts `gbm_lift > 0.015` so a future regeneration that erases or amplifies the trap breaks CI. Notebook 04 — lift, calibration, value, cohort, bootstrap - Calibration / reliability diagram (LR max bin error ≈ 0.13). - Lift @ 1/5/10 % + cumulative gains curve. - `expected_acv × P(convert)` value-aware ranking: top-50 ACV-capture jumps from 0.16 to 0.40 vs P-only ranking. - Threshold selection for fixed top-K capacity (e.g. 50/week). - Cohort-shift evaluation that **reproduces the validation report's `cohort_shift.intermediate` block exactly** (0.8754 / 0.8908 / −0.0155). Mirrors `release_quality.measure_cohort_ shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and `model_random_state=0`. - 200-iter bootstrap of the test set as the within-bundle confidence band — explicitly framed as the proxy for true cross-seed sweep, since public-bundle consumers can't rebuild bundles without `leadforge` installed. - Headline LR/GBM panel drops the trap (matches notebook 02); cohort-shift section keeps the trap to reproduce the report. Audit-sync wiring - `release/notebooks/_release_targets.json` gains a `cohort_shift.intermediate` block sourced from `validation_report.cohort_shift.intermediate`. - `tests/release/notebooks/test_release_targets_match_report.py` adds `test_cohort_shift_targets_match_validation_report` to audit-sync the new block. Existing test now skips the `cohort_shift` key during the per-tier-medians loop. Builders + tests - `scripts/build_release_notebook_{03,04}.py` inherit the deterministic-cell-ID + `--out` byte-stability pattern from PR 6.1. - Added to `_BUILDERS` / `_NOTEBOOKS` in `tests/scripts/test_release_notebook_builders.py` and `tests/release/notebooks/test_execute_notebooks.py`. - Both notebooks execute end-to-end in <10s each (well under G13.1's 3-min budget), assert `manifest.exposure_mode == "student_public"` (G13.3), and load only from `release/intermediate/`. Forward-pointer to `docs/release/break_me_guide.md` left as plain backtick-wrapped text; file lands in PR 6.3. `.agent-plan.md` Phase 6 entry updated to mark PR 6.2 complete. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. Co-Authored-By: Claude Opus 4.7 --- .agent-plan.md | 2 +- .../03_leakage_and_time_windows.ipynb | 394 +++++++++ .../04_lift_calibration_value_ranking.ipynb | 643 ++++++++++++++ release/notebooks/_release_targets.json | 8 + scripts/build_release_notebook_03.py | 542 ++++++++++++ scripts/build_release_notebook_04.py | 822 ++++++++++++++++++ .../notebooks/test_execute_notebooks.py | 2 + .../test_release_targets_match_report.py | 43 +- .../scripts/test_release_notebook_builders.py | 2 + 9 files changed, 2455 insertions(+), 3 deletions(-) create mode 100644 release/notebooks/03_leakage_and_time_windows.ipynb create mode 100644 release/notebooks/04_lift_calibration_value_ranking.ipynb create mode 100644 scripts/build_release_notebook_03.py create mode 100644 scripts/build_release_notebook_04.py diff --git a/.agent-plan.md b/.agent-plan.md index f715f35..99821f3 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -59,7 +59,7 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family ### Phase 6 — Notebook sequence + adversarial framing - [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added. Notebook 01 trains LR + HistGBM on the public `intermediate` bundle using the **same feature set as the validation report** (drops only IDs and the label, mirrors `release_quality._partition_columns`), so the G13.2 reproduction gate compares apples to apples. This means notebook 01 **keeps** `total_touches_all` (the documented leakage trap) — narrative cell calls it out explicitly and forward-points to notebook 03 (PR 6.2) which dissects what dropping the trap does to performance. Notebook 02 by contrast **drops** the trap from the flat baseline so the relational lift attribution stays clean (its goal is teaching feature engineering, not reproducing the report). Targets are loaded at runtime from `release/notebooks/_release_targets.json` (audit-synced against `release/validation/validation_report.json` by `tests/release/notebooks/test_release_targets_match_report.py`); per-metric tolerances replace the original flat ±0.05 (AUC/Brier ±0.02, AP / top-decile ±0.05). Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on **train leads only**, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, prints a 4-row metric panel + delta panel, and pins the four model AUCs and the headline `GBM(eng) − GBM(flat)` lift via `assert_within_tolerance` (sign-aware `assert lift > 0` on top of the absolute tolerance). Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive (the cross-seed `gbm_auc` spread on this bundle is ~0.027); seed-sweep harness lands in PR 6.2's notebook 04. Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` and `top_decile_rate` mirror `release_quality._precision_at_k` / `_top_decile_rate` (locked in by mirror tests), and `assert_within_tolerance` is hardened against silent passes on non-finite metrics or incomplete per-metric tolerance maps. G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the intermediate bundle via `python scripts/build_public_release.py release --tier intermediate` (only tier the notebooks need) then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised, gated on bundles-present). G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`. Builders (`scripts/build_release_notebook_{01,02}.py`, sharing `scripts/_release_notebook_common.py`) emit deterministic byte-for-byte notebook JSON via explicit `cell_NNN` IDs (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2, locked in by `tests/scripts/test_release_notebook_builders.py` which builds twice into `tmp_path` via the new `--out PATH` flag and diffs against the committed file without ever touching the working tree) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree. Net: 1250/1250 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. -- [ ] `release/notebooks/{03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb` +- [x] PR 6.2: `release/notebooks/03_leakage_and_time_windows.ipynb` and `release/notebooks/04_lift_calibration_value_ranking.ipynb` added. Notebook 03 turns the documented `total_touches_all` trap into a teaching moment: reads the trap label off `feature_dictionary.csv`, proves the trap by construction via a same-table comparison of `total_touches_all` (full-horizon) vs `touch_count` (snapshot-safe) — the post-snapshot delta sums to ~3.2 touches/lead and 82 % of leads have a positive delta — and then runs a standalone-AUC probe on the trap (~0.53 AUC, looks innocuous) followed by a side-by-side full-panel ± trap ablation that shows HistGBM extracts ~+0.032 AUC from the same column LR can only squeeze ~+0.009 from. The reframed pedagogy (vs the prompt's original "trap dominates a thin firmographic set" framing) is empirically driven: firmographic-only is at chance AUC even with the trap, but the GBM-vs-LR asymmetry on the strong panel is a real and useful finding — *standalone AUC probes undersell tree-friendly leakage*. Sign-aware tolerance gate pins each AUC ±0.02 and asserts `gbm_lift > 0.015` so a future regeneration that erases the trap or accidentally amplifies it breaks CI. Notebook 04 covers the four extra ranking lenses AUC alone misses: calibration / reliability diagram (max bin error ≈ 0.13), lift + cumulative gains (top-decile lift 2.75×), value-aware ranking via `expected_acv × P(convert)` (top-50 ACV-capture jumps from 0.16 to 0.40), threshold selection for fixed top-K capacity, cohort-shift evaluation (HistGBM on the first 85 % chronologically → score the last 15 %, mirrors `release_quality.measure_cohort_shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and `model_random_state=0`, **reproduces the report's `cohort_shift.intermediate` block exactly**: 0.8754 / 0.8908 / −0.0155), and a 200-iter bootstrap of the test-set AUC/AP as the within-bundle confidence band that public-bundle consumers (Kaggle / HF) can run without `leadforge` installed (the prompt's "seed-sweep harness" with bootstrap honestly acknowledged as the proxy for true cross-seed sweep, since rebuilding bundles isn't an option for downstream users). Cohort-shift values are pinned via a new `cohort_shift.intermediate` block in `release/notebooks/_release_targets.json` (audit-synced against `validation_report.cohort_shift.intermediate` by a new `test_cohort_shift_targets_match_validation_report` extension to the existing audit-sync test). Headline LR/GBM panel **drops** `total_touches_all` (matches notebook 02's posture, gives honest production numbers); cohort-shift section deliberately **keeps** the trap to reproduce the report's published cohort-shift numbers exactly — divergent posture explained inline. Both new builders (`scripts/build_release_notebook_{03,04}.py`) inherit the deterministic-cell-ID + `--out` byte-stability pattern from PR 6.1 and are added to `_BUILDERS` / `_NOTEBOOKS` in `tests/scripts/test_release_notebook_builders.py` and `tests/release/notebooks/test_execute_notebooks.py`. Both notebooks execute end-to-end in <10s each (well under G13.1's 3-min budget), assert `manifest.exposure_mode == "student_public"` (G13.3), and load only from `release/intermediate/`. Forward-pointer to `docs/release/break_me_guide.md` left as plain backtick-wrapped text — file lands in PR 6.3, no dead Markdown link. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. - [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml` - [ ] `docs/release/{break_me_guide,v2_decision_log}.md` diff --git a/release/notebooks/03_leakage_and_time_windows.ipynb b/release/notebooks/03_leakage_and_time_windows.ipynb new file mode 100644 index 0000000..0be6d36 --- /dev/null +++ b/release/notebooks/03_leakage_and_time_windows.ipynb @@ -0,0 +1,394 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cell_000", + "metadata": {}, + "source": "# Notebook 03 — Leakage and Time Windows\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.\n\nThe bundle ships with one **deliberate leakage trap**:\n`total_touches_all`. The feature dictionary marks it\n`leakage_risk = True`; the dataset card calls it out;\nnotebook 01 keeps it (matching the validation report's\npanel) while notebook 02 drops it. This notebook turns the\ntrap into a teaching moment.\n\nWe do four things:\n\n1. **Read the receipts.** The trap is documented in the\n feature dictionary. We surface that label.\n2. **Time-window proof.** Quantify how much\n `total_touches_all` differs from its snapshot-safe\n sibling `touch_count` — the difference is post-snapshot\n information by construction, regardless of how\n predictive that information turns out to be.\n3. **The lesson.** Run a single-column standalone-AUC probe\n on the trap (it looks innocuous, ~0.53). Then run the\n full-panel ± trap comparison: HistGBM extracts a\n substantial AUC lift (+0.03) from a column whose\n standalone AUC is barely above chance. Standalone\n probes undersell tree-friendly leakage.\n4. **Pin the deltas.** Sign-aware tolerance gates so a\n future regeneration that neutralises the trap (or\n accidentally amplifies it) breaks CI.\n\n**Public path discipline (G13.3).** This notebook reads only\nfrom `release/intermediate/` (the public student bundle). The\ninstructor companion is **not** loaded — leakage detection\nhas to work from the public artefact alone, since that's all\na downstream consumer ever has." + }, + { + "cell_type": "markdown", + "id": "cell_001", + "metadata": {}, + "source": "## 1. Setup" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_002", + "metadata": {}, + "outputs": [], + "source": [ + "from __future__ import annotations\n", + "\n", + "import json\n", + "import sys\n", + "from pathlib import Path\n", + "\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn.compose import ColumnTransformer\n", + "from sklearn.ensemble import HistGradientBoostingClassifier\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.metrics import roc_auc_score\n", + "from sklearn.pipeline import Pipeline\n", + "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", + "\n", + "sys.path.insert(0, str(Path.cwd()))\n", + "from _notebook_utils import assert_within_tolerance, precision_at_k\n", + "\n", + "SEED = 42\n", + "BUNDLE = Path(\"../intermediate\") # public student bundle\n", + "TASK = \"converted_within_90_days\"\n", + "TRAP = \"total_touches_all\"\n", + "\n", + "with (BUNDLE / \"manifest.json\").open() as fh:\n", + " manifest = json.load(fh)\n", + "assert manifest[\"exposure_mode\"] == \"student_public\"\n", + "assert manifest[\"relational_snapshot_safe\"] is True\n", + "SNAPSHOT_DAY = int(manifest[\"snapshot_day\"])\n", + "HORIZON_DAYS = int(manifest[\"horizon_days\"])\n", + "print(f\"snapshot_day = {SNAPSHOT_DAY} horizon_days = {HORIZON_DAYS}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_003", + "metadata": {}, + "source": "## 2. The trap, as the feature dictionary calls it out\n\nThe release ships a `feature_dictionary.csv` next to the\ndata. Any column with `leakage_risk = True` is flagged as a\n**deliberate teaching trap** — included so users can practise\ndetecting it, with the trap's nature documented inline.\n\nTreat the feature dictionary as the first place you look on\nany new dataset. A column named `total_touches_all` is not\nobviously bad until the dictionary tells you it counts\ntouches over the full 90-day horizon, well past the\n30-day snapshot anchor that defines the prediction time." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_004", + "metadata": {}, + "outputs": [], + "source": [ + "feat_dict = pd.read_csv(BUNDLE / \"feature_dictionary.csv\")\n", + "traps = feat_dict[feat_dict[\"leakage_risk\"].astype(bool)]\n", + "print(f\"trap columns flagged in feature_dictionary.csv: {len(traps)}\")\n", + "for _, row in traps.iterrows():\n", + " print(f\" {row['name']}: {row['description']}\")\n", + "assert TRAP in set(traps[\"name\"]), f\"{TRAP} expected to be flagged in dictionary\"" + ] + }, + { + "cell_type": "markdown", + "id": "cell_005", + "metadata": {}, + "source": "## 3. Time-window proof — the trap *by construction*\n\nThe dictionary *says* `total_touches_all` uses post-snapshot\ndata. We verify that on the same row that carries the trap:\nthe task table also carries `touch_count`, the\n**snapshot-safe** touch aggregate (filtered to\n`touch_timestamp <= lead_created_at + snapshot_day`). Their\ndifference is the **post-snapshot delta** — by construction,\ninformation from days 31–90 that the model should never see\nwhen scoring at day 30.\n\nThe pedagogical point is independent of how predictive that\ndifference turns out to be. **A column that uses\npost-snapshot data is invalid at scoring time even when it\nlooks unpredictive in isolation.** Section 4 measures that\n\"looks unpredictive in isolation\" claim directly, then\nsection 5 shows it can be misleading.\n\nWe pool all three task splits so the receipt covers every\nlead in the bundle." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_006", + "metadata": {}, + "outputs": [], + "source": [ + "train = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"train.parquet\")\n", + "valid = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"valid.parquet\")\n", + "test = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"test.parquet\")\n", + "\n", + "all_leads = pd.concat([train, valid, test], ignore_index=True)\n", + "assert all_leads[\"lead_id\"].is_unique, \"expected one row per lead across train/valid/test\"\n", + "\n", + "window = all_leads[[\"lead_id\", TRAP, \"touch_count\", TASK]].copy()\n", + "window[TRAP] = pd.to_numeric(window[TRAP], errors=\"coerce\")\n", + "window[\"touch_count\"] = pd.to_numeric(window[\"touch_count\"], errors=\"coerce\")\n", + "window = window.dropna(subset=[TRAP, \"touch_count\"]).copy()\n", + "window[\"post_snapshot_touches\"] = window[TRAP] - window[\"touch_count\"]\n", + "window[TASK] = window[TASK].astype(\"boolean\").fillna(False).astype(int)\n", + "\n", + "print(f\"leads used in this section: {len(window):,}\")\n", + "print(f\" {TRAP:<22s} mean={window[TRAP].mean():6.2f} max={int(window[TRAP].max()):>4d}\")\n", + "print(\n", + " f\" {'touch_count (snapshot-safe)':<22s} \"\n", + " f\"mean={window['touch_count'].mean():6.2f} \"\n", + " f\"max={int(window['touch_count'].max()):>4d}\"\n", + ")\n", + "mean_delta = float(window[\"post_snapshot_touches\"].mean())\n", + "n_post = int((window[\"post_snapshot_touches\"] > 0).sum())\n", + "print(\n", + " f\" {'post-snapshot delta':<22s} \"\n", + " f\"mean={mean_delta:6.2f} \"\n", + " f\"max={int(window['post_snapshot_touches'].max()):>4d}\"\n", + ")\n", + "print(\n", + " f\" → {n_post:,} of {len(window):,} leads \"\n", + " f\"({n_post / len(window):.1%}) have a positive post-snapshot delta\"\n", + ")\n", + "assert mean_delta > 0, (\n", + " \"expected a positive mean post-snapshot delta — if zero, the trap may \"\n", + " \"have been silently rebuilt as a snapshot-safe aggregate\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_007", + "metadata": {}, + "source": "### 3.1 The post-snapshot delta is uncorrelated with the label *on this dataset*\n\nOn the v1 procurement world, the count of touches between\nday 30 and day 90 turns out to be roughly the same for\nconverted and non-converted leads — sales reps keep working\nboth groups for a while before the funnel settles. A\nstronger world (more aggressive sales follow-up on hot\nleads) would split these apart; this one doesn't.\n\nThe plot below makes that lack-of-split visible. The trap\nis *still a trap* — we just can't tell that from the\npost-snapshot delta alone, which is why the validation\nreport's `post_snapshot_aggregates` baseline (a single-\ncolumn probe) gives an AUC of only ~0.55. The real damage\nshows up when a tree model gets to combine the trap with\nother columns; section 5 measures that." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_008", + "metadata": {}, + "outputs": [], + "source": [ + "grouped = window.groupby(TASK)[\"post_snapshot_touches\"].agg([\"mean\", \"median\", \"count\"])\n", + "grouped.index = grouped.index.map({0: \"non-converted\", 1: \"converted\"})\n", + "print(grouped)\n", + "\n", + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "data = [\n", + " window.loc[window[TASK] == 0, \"post_snapshot_touches\"],\n", + " window.loc[window[TASK] == 1, \"post_snapshot_touches\"],\n", + "]\n", + "ax.boxplot(data, tick_labels=[\"non-converted\", \"converted\"], showfliers=False)\n", + "ax.set_ylabel(\"post-snapshot touches (total_touches_all − touch_count)\")\n", + "ax.set_title(\n", + " \"Post-snapshot delta by label\\n(roughly the same — section 5 explains why this is misleading)\"\n", + ")\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "cell_009", + "metadata": {}, + "source": "## 4. Standalone-AUC probe (the audit that almost lets the trap pass)\n\nA common leakage audit is to fit a one-feature classifier on\neach suspect column and report the standalone AUC. The\nvalidation report does this at scale — its\n`post_snapshot_aggregates` baseline trains a model on the\nsingle column `total_touches_all` and reports an AUC around\n0.55. That sounds tame, and on a busy schedule it's tempting\nto clear the column on those grounds.\n\nWe re-run the probe here so you've seen the number with your\nown eyes: ~0.53. If that's all you measure, the trap looks\nbarely worth mentioning. Section 5 shows what that audit\nmisses." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_010", + "metadata": {}, + "outputs": [], + "source": [ + "standalone_window = window.dropna(subset=[TRAP, \"touch_count\"]).copy()\n", + "y = standalone_window[TASK].astype(int).to_numpy()\n", + "standalone = {\n", + " TRAP: float(roc_auc_score(y, standalone_window[TRAP].to_numpy())),\n", + " \"touch_count (snapshot-safe)\": float(\n", + " roc_auc_score(y, standalone_window[\"touch_count\"].to_numpy())\n", + " ),\n", + " \"post-snapshot delta\": float(\n", + " roc_auc_score(y, standalone_window[\"post_snapshot_touches\"].to_numpy())\n", + " ),\n", + "}\n", + "print(f\"{'feature':<32s} {'standalone AUC':>16s}\")\n", + "for name, auc in standalone.items():\n", + " print(f\" {name:<30s} {auc:>16.4f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_011", + "metadata": {}, + "source": "## 5. Side-by-side AUC: full panel ± trap\n\nTrain two HistGBM and two Logistic Regression baselines on\nthe **same train/test split, same model, same seed** —\nthe only thing that varies is whether `total_touches_all`\nis in the column list.\n\nWe use the full as-shipped feature panel (every public\nsnapshot column except IDs / label) as the baseline. This\nmirrors notebook 01 / the validation report's setup, so the\nwith-trap AUC reproduces the report's published number and\nthe without-trap AUC is what notebook 02 starts from." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_012", + "metadata": {}, + "outputs": [], + "source": [ + "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n", + "EXCLUDE = set(ID_COLS + [TASK])\n", + "\n", + "full_cols = [c for c in train.columns if c not in EXCLUDE]\n", + "full_cols_no_trap = [c for c in full_cols if c != TRAP]\n", + "print(f\"full panel: {len(full_cols)} cols (incl. {TRAP})\")\n", + "print(f\"full panel no trap: {len(full_cols_no_trap)} cols\")\n", + "\n", + "\n", + "def _split_cols(df: pd.DataFrame, cols: list[str]) -> tuple[list[str], list[str]]:\n", + " cat = [\n", + " c\n", + " for c in cols\n", + " if not (pd.api.types.is_bool_dtype(df[c]) or pd.api.types.is_numeric_dtype(df[c]))\n", + " ]\n", + " num = [c for c in cols if c not in cat]\n", + " return num, cat\n", + "\n", + "\n", + "def _sanitize(df: pd.DataFrame, cat_cols: list[str]) -> pd.DataFrame:\n", + " out = df.copy()\n", + " for c in cat_cols:\n", + " out[c] = out[c].astype(object).where(out[c].notna(), None)\n", + " return out\n", + "\n", + "\n", + "def _build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline:\n", + " num_t = Pipeline(\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"median\")),\n", + " (\"scaler\", StandardScaler()),\n", + " ]\n", + " )\n", + " cat_t = Pipeline(\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n", + " (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n", + " ]\n", + " )\n", + " pre = ColumnTransformer(\n", + " [(\"num\", num_t, num_cols), (\"cat\", cat_t, cat_cols)],\n", + " remainder=\"drop\",\n", + " )\n", + " if model == \"lr\":\n", + " clf = LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=SEED)\n", + " else:\n", + " clf = HistGradientBoostingClassifier(random_state=SEED)\n", + " return Pipeline([(\"preprocessor\", pre), (\"classifier\", clf)])\n", + "\n", + "\n", + "def fit_score(cols: list[str], *, model: str) -> np.ndarray:\n", + " num_cols, cat_cols = _split_cols(train, cols)\n", + " pipe = _build_pipeline(num_cols, cat_cols, model=model)\n", + " pipe.fit(_sanitize(train[cols], cat_cols), y_train)\n", + " return pipe.predict_proba(_sanitize(test[cols], cat_cols))[:, 1]\n", + "\n", + "\n", + "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", + "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", + "base_rate = float(y_test.mean())\n", + "print(f\"test base rate: {base_rate:.3f}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_013", + "metadata": {}, + "outputs": [], + "source": [ + "results: dict[str, dict[str, float]] = {}\n", + "for model in (\"lr\", \"gbm\"):\n", + " p_with = fit_score(full_cols, model=model)\n", + " p_without = fit_score(full_cols_no_trap, model=model)\n", + " results[model] = {\n", + " \"with_trap_auc\": float(roc_auc_score(y_test, p_with)),\n", + " \"without_trap_auc\": float(roc_auc_score(y_test, p_without)),\n", + " \"with_trap_p100\": precision_at_k(p_with, y_test, 100),\n", + " \"without_trap_p100\": precision_at_k(p_without, y_test, 100),\n", + " }\n", + "\n", + "print(f\"{'model':<5s} {'with trap':>10s} {'without trap':>13s} {'Δ AUC':>8s}\")\n", + "for m, r in results.items():\n", + " d = r[\"with_trap_auc\"] - r[\"without_trap_auc\"]\n", + " print(f\"{m:<5s} {r['with_trap_auc']:>10.4f} {r['without_trap_auc']:>13.4f} {d:+8.4f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_014", + "metadata": {}, + "source": "### 5.1 The lesson — standalone AUC underestimates trap impact\n\nSection 4 says `total_touches_all` is barely above chance\n(~0.53 AUC) on its own. Section 5 says HistGBM extracts a\nsizeable lift (~+0.03 AUC) from the same column once it can\ncombine it with the rest of the feature panel. Both\nmeasurements are correct; they just measure different things.\n\n**Why the gap?** A standalone-AUC probe asks *can this\ncolumn rank leads when it's the only signal you have?* A\ntree model with the rest of the panel already in scope asks\n*can this column refine my existing splits?* The trap's\npost-snapshot information correlates with other columns\nnon-linearly — a few late touches by an outbound rep on an\nengaged-but-not-yet-converted lead is a very different\nsignal from the same touches on a cold lead — and the\ntree can carve the join, while a single-feature probe\nnever sees it. The Logistic Regression gain is much smaller\n(~+0.01) for the same reason: it cannot represent that\ninteraction structure.\n\nBar chart below highlights the asymmetry." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_015", + "metadata": {}, + "outputs": [], + "source": [ + "labels = [\"GBM full\", \"LR full\"]\n", + "deltas = [\n", + " results[\"gbm\"][\"with_trap_auc\"] - results[\"gbm\"][\"without_trap_auc\"],\n", + " results[\"lr\"][\"with_trap_auc\"] - results[\"lr\"][\"without_trap_auc\"],\n", + "]\n", + "colors = [\"#3b82f6\", \"#9ca3af\"]\n", + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "ax.bar(range(len(labels)), deltas, color=colors)\n", + "ax.axhline(0.0, color=\"#1f2937\", linewidth=0.8)\n", + "ax.axhline(\n", + " standalone[TRAP] - 0.5,\n", + " color=\"#ef4444\",\n", + " linestyle=\"--\",\n", + " label=f\"standalone-AUC excess ({standalone[TRAP] - 0.5:+.3f})\",\n", + ")\n", + "ax.set_xticks(range(len(labels)))\n", + "ax.set_xticklabels(labels)\n", + "ax.set_ylabel(\"ΔAUC = with_trap − without_trap\")\n", + "ax.set_title(\"Trap impact — tree models extract more than the probe predicts\")\n", + "ax.legend(loc=\"best\", fontsize=8)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "cell_016", + "metadata": {}, + "source": "## 6. Tolerance gate (G13.2)\n\nSingle-seed (seed=42) AUCs and trap deltas observed on the\nas-shipped intermediate bundle. Tolerances pin each AUC to\nwithin ±0.02 (well outside numerical jitter, well inside the\nband that would hide a regression). The sign-aware\nassertion below makes the pedagogical claim load-bearing:\nif a regeneration ever neutralises the GBM trap-delta, this\nfails — even if the absolute AUCs stay inside their bands." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_017", + "metadata": {}, + "outputs": [], + "source": [ + "NB03_TARGETS = {\n", + " \"lr_with_trap_auc\": 0.8827,\n", + " \"lr_without_trap_auc\": 0.8737,\n", + " \"gbm_with_trap_auc\": 0.8754,\n", + " \"gbm_without_trap_auc\": 0.8432,\n", + " \"trap_standalone_auc\": 0.5310,\n", + "}\n", + "NB03_TOLERANCES = dict.fromkeys(NB03_TARGETS, 0.02)\n", + "\n", + "observed = {\n", + " \"lr_with_trap_auc\": results[\"lr\"][\"with_trap_auc\"],\n", + " \"lr_without_trap_auc\": results[\"lr\"][\"without_trap_auc\"],\n", + " \"gbm_with_trap_auc\": results[\"gbm\"][\"with_trap_auc\"],\n", + " \"gbm_without_trap_auc\": results[\"gbm\"][\"without_trap_auc\"],\n", + " \"trap_standalone_auc\": standalone[TRAP],\n", + "}\n", + "assert_within_tolerance(\n", + " observed=observed,\n", + " target=NB03_TARGETS,\n", + " tolerances=NB03_TOLERANCES,\n", + " label=\"notebook 03 trap-panel AUCs (seed 42, intermediate)\",\n", + ")\n", + "\n", + "# Sign-aware: GBM must extract a meaningful lift from the\n", + "# trap. Threshold sits well below the seed-42 observation\n", + "# (~+0.032) but well above LR's +0.009, so it specifically\n", + "# guards the tree-model lift the section-5 narrative claims.\n", + "MIN_GBM_LIFT = 0.015\n", + "gbm_lift = results[\"gbm\"][\"with_trap_auc\"] - results[\"gbm\"][\"without_trap_auc\"]\n", + "assert gbm_lift > MIN_GBM_LIFT, (\n", + " f\"GBM trap-lift collapsed: {gbm_lift:+.4f} <= {MIN_GBM_LIFT:.4f} — \"\n", + " \"the trap is no longer carrying the pedagogical lesson in section 5\"\n", + ")\n", + "print(\"OK — trap-panel AUCs in tolerance and GBM lift positive.\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_018", + "metadata": {}, + "source": "## 7. A detection recipe you can run on any dataset\n\nThe trap was easy to spot here because the dataset\n*advertises* it. On a third-party dataset you don't get\nthat courtesy. The same recipe still works:\n\n1. **Read any feature dictionary you have.** Any column\n whose description references a window longer than the\n prediction horizon is suspicious. Even when no\n dictionary ships, an obvious naming smell (`*_total`,\n `*_all`, `*_lifetime`) on a 30-day-snapshot dataset is a\n flag.\n2. **Probe the standalone AUC** *and* **the contribution to\n a tree model.** A standalone probe alone undersells\n tree-friendly leakage (sections 4 and 5 demonstrate why\n on this dataset). Train a model with the column, train\n another without, and compare. The ablation captures\n interactions the standalone probe can't.\n3. **Inspect the time window.** Cross-check the suspect\n column against any time-stamped event tables. If the\n column's value can only be explained by events past the\n snapshot anchor, you've found a trap. Section 3 makes\n this concrete here — the same technique generalises\n anywhere there's an event table to corroborate.\n\nA walkthrough of additional detection patterns\n(column-name heuristics, isolation-via-residuals,\ntarget-encoding leakage on test) lives in\n`docs/release/break_me_guide.md` (coming in PR 6.3) — pair\nit with this notebook for a more complete playbook.\n\n## Next\n\n- **Notebook 04** — value-aware ranking\n (`expected_acv` × P(convert)), calibration plots,\n threshold selection for top-K capacity, and a\n cohort-shift / bootstrap robustness harness." + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/release/notebooks/04_lift_calibration_value_ranking.ipynb b/release/notebooks/04_lift_calibration_value_ranking.ipynb new file mode 100644 index 0000000..a94d64b --- /dev/null +++ b/release/notebooks/04_lift_calibration_value_ranking.ipynb @@ -0,0 +1,643 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cell_000", + "metadata": {}, + "source": "# Notebook 04 — Lift, Calibration, Value-Aware Ranking\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.\n\nAUC ranks well; that doesn't mean it ranks *for the right\nthing*. Sales teams care about three additional concerns\nAUC alone never tells you about:\n\n1. **Calibration.** Are predicted probabilities trustworthy\n as point estimates, or just as a ranking?\n2. **Value-aware ranking.** A 30 %-likely lead worth $200K\n is more valuable than a 60 %-likely one worth $20K.\n Ranking by P(convert) wastes ACV; ranking by\n P(convert) × `expected_acv` doesn't.\n3. **Robustness.** Does the model still work next quarter\n (cohort shift)? How tight is the metric on the test set\n you have (bootstrap)?\n\nWe answer all three on the public bundle, plus a threshold-\nselection walkthrough that maps a fixed sales-capacity\nconstraint to an operating point. The notebook closes with\na tolerance gate that pins the cohort-shift result to the\npublished validation report — if a regeneration ever\nsilently changes the cohort-degradation behaviour, CI\ncatches it.\n\n**Public path discipline (G13.3).** Loads only\n`release/intermediate/` (the public student bundle).\nInstructor-only artefacts (the latent registry, full-horizon\nevent tables, hidden DAG) are never read.\n\n**Trap discipline.** The headline LR / GBM panel drops\n`total_touches_all` (per notebook 02's leakage discipline)\nso the metrics it reports are honest production numbers.\nThe cohort-shift section deliberately *keeps* the trap to\nreproduce the validation report's cohort-shift block — the\nreport's panel is the as-shipped one, and we want a\ncomparable number, not a cleaner one." + }, + { + "cell_type": "markdown", + "id": "cell_001", + "metadata": {}, + "source": "## 1. Setup" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_002", + "metadata": {}, + "outputs": [], + "source": [ + "from __future__ import annotations\n", + "\n", + "import json\n", + "import sys\n", + "from pathlib import Path\n", + "\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn.compose import ColumnTransformer\n", + "from sklearn.ensemble import HistGradientBoostingClassifier\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.metrics import (\n", + " average_precision_score,\n", + " brier_score_loss,\n", + " roc_auc_score,\n", + ")\n", + "from sklearn.pipeline import Pipeline\n", + "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", + "\n", + "sys.path.insert(0, str(Path.cwd()))\n", + "from _notebook_utils import assert_within_tolerance\n", + "\n", + "SEED = 42\n", + "BUNDLE = Path(\"../intermediate\") # public student bundle\n", + "TASK = \"converted_within_90_days\"\n", + "TRAP = \"total_touches_all\"\n", + "\n", + "with (BUNDLE / \"manifest.json\").open() as fh:\n", + " manifest = json.load(fh)\n", + "assert manifest[\"exposure_mode\"] == \"student_public\"\n", + "assert manifest[\"relational_snapshot_safe\"] is True\n", + "\n", + "train = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"train.parquet\")\n", + "test = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"test.parquet\")\n", + "print(f\"train rows: {len(train):,}\")\n", + "print(f\"test rows: {len(test):,}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_003", + "metadata": {}, + "source": "## 2. Train the headline LR + GBM panel\n\nSame preprocessing as notebooks 01 / 02 (mirrors\n`leadforge.validation.release_quality._build_pipeline`).\nWe drop the documented leakage trap `total_touches_all`\nhere so the calibration / lift / value plots in sections\n3–6 reflect a honest production model. The cohort-shift\nsection in section 7 uses the validator's full-panel\nposture (trap kept) so its number is comparable to the\npublished validation report." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_004", + "metadata": {}, + "outputs": [], + "source": [ + "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n", + "EXCLUDE_HEADLINE = set(ID_COLS + [TASK, TRAP])\n", + "headline_cols = [c for c in train.columns if c not in EXCLUDE_HEADLINE]\n", + "cat_cols = [\n", + " c\n", + " for c in headline_cols\n", + " if not (pd.api.types.is_bool_dtype(train[c]) or pd.api.types.is_numeric_dtype(train[c]))\n", + "]\n", + "num_cols = [c for c in headline_cols if c not in cat_cols]\n", + "print(f\"headline panel: {len(headline_cols)} cols (trap dropped)\")\n", + "\n", + "\n", + "def _sanitize(df: pd.DataFrame, cats: list[str]) -> pd.DataFrame:\n", + " out = df.copy()\n", + " for c in cats:\n", + " out[c] = out[c].astype(object).where(out[c].notna(), None)\n", + " return out\n", + "\n", + "\n", + "def build_pipeline(num: list[str], cat: list[str], *, model: str) -> Pipeline:\n", + " pre = ColumnTransformer(\n", + " [\n", + " (\n", + " \"num\",\n", + " Pipeline(\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"median\")),\n", + " (\"scaler\", StandardScaler()),\n", + " ]\n", + " ),\n", + " num,\n", + " ),\n", + " (\n", + " \"cat\",\n", + " Pipeline(\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n", + " (\n", + " \"encoder\",\n", + " OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False),\n", + " ),\n", + " ]\n", + " ),\n", + " cat,\n", + " ),\n", + " ],\n", + " remainder=\"drop\",\n", + " )\n", + " clf = (\n", + " LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=SEED)\n", + " if model == \"lr\"\n", + " else HistGradientBoostingClassifier(random_state=SEED)\n", + " )\n", + " return Pipeline([(\"preprocessor\", pre), (\"classifier\", clf)])\n", + "\n", + "\n", + "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", + "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", + "base_rate = float(y_test.mean())\n", + "\n", + "x_train = _sanitize(train[headline_cols], cat_cols)\n", + "x_test = _sanitize(test[headline_cols], cat_cols)\n", + "\n", + "lr_pipe = build_pipeline(num_cols, cat_cols, model=\"lr\").fit(x_train, y_train)\n", + "gbm_pipe = build_pipeline(num_cols, cat_cols, model=\"gbm\").fit(x_train, y_train)\n", + "lr_probs = lr_pipe.predict_proba(x_test)[:, 1]\n", + "gbm_probs = gbm_pipe.predict_proba(x_test)[:, 1]\n", + "\n", + "print(f\" base rate: {base_rate:.3f}\")\n", + "print(\n", + " f\" LR AUC: {roc_auc_score(y_test, lr_probs):.4f} \"\n", + " f\"AP: {average_precision_score(y_test, lr_probs):.4f} \"\n", + " f\"Brier: {brier_score_loss(y_test, lr_probs):.4f}\"\n", + ")\n", + "print(\n", + " f\" GBM AUC: {roc_auc_score(y_test, gbm_probs):.4f} \"\n", + " f\"AP: {average_precision_score(y_test, gbm_probs):.4f} \"\n", + " f\"Brier: {brier_score_loss(y_test, gbm_probs):.4f}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_005", + "metadata": {}, + "source": "## 3. Calibration / reliability diagram\n\nBin LR's predicted probabilities into ten equal-width\nbuckets, plot mean predicted vs mean observed. A perfectly\ncalibrated model lies on the diagonal; LR after\n`StandardScaler + LogisticRegression` is usually close.\nWe also surface `max_bin_error` — the worst gap across\nnon-empty bins — which the validation report tracks\n(`tiers.intermediate.medians.calibration_max_bin_error`)." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_006", + "metadata": {}, + "outputs": [], + "source": [ + "edges = np.linspace(0.0, 1.0, 11)\n", + "mean_pred: list[float] = []\n", + "mean_actual: list[float] = []\n", + "bin_n: list[int] = []\n", + "for i in range(10):\n", + " lo, hi = edges[i], edges[i + 1]\n", + " mask = (lr_probs >= lo) & ((lr_probs <= hi) if i == 9 else (lr_probs < hi))\n", + " if mask.sum() == 0:\n", + " continue\n", + " mean_pred.append(float(lr_probs[mask].mean()))\n", + " mean_actual.append(float(y_test[mask].mean()))\n", + " bin_n.append(int(mask.sum()))\n", + "\n", + "max_bin_err = max(abs(p - a) for p, a in zip(mean_pred, mean_actual, strict=False))\n", + "print(f\"max bin error (LR): {max_bin_err:.4f}\")\n", + "for p, a, n in zip(mean_pred, mean_actual, bin_n, strict=False):\n", + " print(f\" pred={p:.3f} actual={a:.3f} n={n:>4d}\")\n", + "\n", + "fig, ax = plt.subplots(figsize=(5, 5))\n", + "ax.plot([0, 1], [0, 1], color=\"#9ca3af\", linestyle=\"--\", label=\"perfect calibration\")\n", + "ax.plot(mean_pred, mean_actual, marker=\"o\", color=\"#3b82f6\", label=\"LR\")\n", + "ax.set_xlim(0, 1)\n", + "ax.set_ylim(0, 1)\n", + "ax.set_xlabel(\"Mean predicted probability\")\n", + "ax.set_ylabel(\"Observed conversion rate\")\n", + "ax.set_title(\"Calibration — LR, intermediate tier (seed 42)\")\n", + "ax.legend(loc=\"upper left\")\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "cell_007", + "metadata": {}, + "source": "## 4. Lift and cumulative gains\n\nTwo complementary curves:\n\n* **Cumulative gains** — fraction of positives captured as\n you sweep the score threshold. Top 10 % of the ranked\n list captures ~26 % of converted leads on this seed (vs\n the 10 % a random ranker would catch).\n* **Lift at *k* %** — `top_k_conversion_rate / base_rate`.\n Lift = 2 means \"the top 1 % of leads convert at twice\n the base rate.\"\n\nBoth metrics are in `release/validation/validation_report.json`\n(`per_seed[0].cumulative_gains` and `per_seed[0].lift_at_pct`)\nso the reproduction is auditable." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_008", + "metadata": {}, + "outputs": [], + "source": [ + "order = np.argsort(-lr_probs, kind=\"stable\")\n", + "y_sorted = y_test[order]\n", + "n = len(y_test)\n", + "n_pos = int(y_test.sum())\n", + "\n", + "# Cumulative gains: fraction of positives captured by top-pct.\n", + "pcts = np.arange(0, 101, 10)\n", + "gains = []\n", + "for pct in pcts:\n", + " k = max(1, int(round(n * pct / 100.0)))\n", + " if pct == 0:\n", + " gains.append(0.0)\n", + " else:\n", + " gains.append(float(y_sorted[:k].sum() / n_pos))\n", + "\n", + "# Lift at 1 / 5 / 10 %.\n", + "lifts = {}\n", + "for pct in [1.0, 5.0, 10.0]:\n", + " k = max(1, int(round(n * pct / 100.0)))\n", + " lifts[pct] = float(y_sorted[:k].mean() / base_rate)\n", + "\n", + "for pct, lift in lifts.items():\n", + " print(f\" lift @ top {pct:>4.0f}%: {lift:.3f}x\")\n", + "\n", + "fig, axes = plt.subplots(1, 2, figsize=(11, 4))\n", + "axes[0].plot(pcts, gains, marker=\"o\", color=\"#3b82f6\", label=\"LR\")\n", + "axes[0].plot([0, 100], [0, 1], color=\"#9ca3af\", linestyle=\"--\", label=\"random\")\n", + "axes[0].set_xlabel(\"Top-pct of ranked leads\")\n", + "axes[0].set_ylabel(\"Cumulative conversion capture\")\n", + "axes[0].set_title(\"Cumulative gains\")\n", + "axes[0].legend(loc=\"lower right\")\n", + "\n", + "axes[1].bar(\n", + " [str(int(p)) for p in lifts],\n", + " list(lifts.values()),\n", + " color=\"#3b82f6\",\n", + ")\n", + "axes[1].axhline(1.0, color=\"#ef4444\", linestyle=\"--\", label=\"random (lift=1)\")\n", + "axes[1].set_xlabel(\"Top-pct of ranked leads\")\n", + "axes[1].set_ylabel(\"Lift over base rate\")\n", + "axes[1].set_title(\"Lift at top-pct\")\n", + "axes[1].legend()\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "cell_009", + "metadata": {}, + "source": "## 5. Value-aware ranking — `expected_acv` × P(convert)\n\nSales reps don't have infinite capacity, so the right\nobjective is rarely \"maximise conversion count\" — it's\n\"maximise revenue captured per outreach slot.\" The bundle\nships an `expected_acv` column (opportunity ACV when\navailable, else revenue-band midpoint heuristic) which\nmakes value-aware ranking trivial:\n\n$$ \\text{score}_\\text{value} = P(\\text{convert}) \\times\n\\text{expected\\_acv} $$\n\nWe compare two top-K policies — rank by P(convert) only\nvs rank by score_value — and report\n`expected_acv_capture_at_k = sum(acv * y) over top-K /\nsum(acv * y) over the whole test`. The validation report's\n`per_seed[0].expected_acv_capture_at_k` is the reference." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_010", + "metadata": {}, + "outputs": [], + "source": [ + "acv = pd.to_numeric(test[\"expected_acv\"], errors=\"coerce\").fillna(0.0).to_numpy()\n", + "value_score = lr_probs * acv\n", + "\n", + "\n", + "def acv_capture(scores: np.ndarray, k: int) -> float:\n", + " order = np.argsort(-scores, kind=\"stable\")\n", + " captured = float(np.sum(acv[order[:k]] * y_test[order[:k]]))\n", + " total = float(np.sum(acv * y_test))\n", + " return captured / total if total > 0 else float(\"nan\")\n", + "\n", + "\n", + "print(f\"{'top-K':<6s} {'cap by P(conv)':>14s} {'cap by P×ACV':>13s} {'gain':>7s}\")\n", + "value_gains = {}\n", + "for k in (50, 100, 200):\n", + " cap_p = acv_capture(lr_probs, k)\n", + " cap_v = acv_capture(value_score, k)\n", + " value_gains[k] = cap_v - cap_p\n", + " print(f\" top {k:<3d} {cap_p:>14.4f} {cap_v:>13.4f} {cap_v - cap_p:+7.4f}\")\n", + "\n", + "# Plot side-by-side ACV capture for K in 10..300.\n", + "ks = np.arange(10, 301, 10)\n", + "cap_p = [acv_capture(lr_probs, int(k)) for k in ks]\n", + "cap_v = [acv_capture(value_score, int(k)) for k in ks]\n", + "fig, ax = plt.subplots(figsize=(7, 4))\n", + "ax.plot(ks, cap_p, marker=\"o\", color=\"#9ca3af\", label=\"rank by P(convert)\")\n", + "ax.plot(ks, cap_v, marker=\"o\", color=\"#3b82f6\", label=\"rank by P(convert)×ACV\")\n", + "ax.set_xlabel(\"top-K leads contacted\")\n", + "ax.set_ylabel(\"Fraction of converted-ACV captured\")\n", + "ax.set_title(\"Value-aware ranking captures more revenue per outreach slot\")\n", + "ax.legend(loc=\"lower right\")\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "cell_011", + "metadata": {}, + "source": "## 6. Threshold selection for fixed top-K capacity\n\nSales rarely has the patience for \"score everything, run\nstats.\" The realistic ask is: *\"My team can work 50 leads\nthis week. Set a probability threshold that selects ~50\nfrom the test population.\"*\n\nWe sweep the probability threshold across the LR score\ndistribution and report precision / recall / count above\nthreshold for each step, then pick the threshold whose\ncount is closest to the requested capacity." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_012", + "metadata": {}, + "outputs": [], + "source": [ + "CAPACITY = 50\n", + "\n", + "sorted_probs = np.sort(lr_probs)[::-1]\n", + "# The K-th highest probability is the smallest threshold that\n", + "# admits exactly K leads (ties resolved by score order).\n", + "threshold = float(sorted_probs[CAPACITY - 1])\n", + "mask = lr_probs >= threshold\n", + "n_above = int(mask.sum())\n", + "prec = float(y_test[mask].mean()) if n_above > 0 else float(\"nan\")\n", + "recall = float(y_test[mask].sum() / max(int(y_test.sum()), 1))\n", + "print(\n", + " f\"capacity={CAPACITY} threshold={threshold:.3f} \"\n", + " f\"actually_above={n_above} precision={prec:.3f} recall={recall:.3f}\"\n", + ")\n", + "\n", + "# Threshold sweep — show what happens around the operating\n", + "# point so the threshold choice is informed, not magic.\n", + "thresholds = np.linspace(float(np.quantile(lr_probs, 0.5)), float(np.max(lr_probs)), 30)\n", + "counts = [int((lr_probs >= t).sum()) for t in thresholds]\n", + "precs = [\n", + " float(y_test[lr_probs >= t].mean()) if (lr_probs >= t).sum() > 0 else 0.0 for t in thresholds\n", + "]\n", + "\n", + "fig, axes = plt.subplots(1, 2, figsize=(11, 4))\n", + "axes[0].plot(thresholds, counts, marker=\"o\", color=\"#3b82f6\")\n", + "axes[0].axhline(CAPACITY, color=\"#ef4444\", linestyle=\"--\", label=f\"capacity={CAPACITY}\")\n", + "axes[0].axvline(threshold, color=\"#10b981\", linestyle=\"--\", label=f\"chosen ({threshold:.3f})\")\n", + "axes[0].set_xlabel(\"threshold\")\n", + "axes[0].set_ylabel(\"# leads above threshold\")\n", + "axes[0].set_title(\"Threshold sweep — count above\")\n", + "axes[0].legend()\n", + "\n", + "axes[1].plot(thresholds, precs, marker=\"o\", color=\"#3b82f6\")\n", + "axes[1].axhline(base_rate, color=\"#9ca3af\", linestyle=\"--\", label=f\"base rate ({base_rate:.3f})\")\n", + "axes[1].axvline(threshold, color=\"#10b981\", linestyle=\"--\", label=f\"chosen ({threshold:.3f})\")\n", + "axes[1].set_xlabel(\"threshold\")\n", + "axes[1].set_ylabel(\"precision above threshold\")\n", + "axes[1].set_title(\"Threshold sweep — precision above\")\n", + "axes[1].legend()\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "cell_013", + "metadata": {}, + "source": "## 7. Cohort-shift evaluation\n\nThe bundle's train/test split is a uniform random split of\nleads. A more realistic stress test is \"train on the first\n70 % of leads chronologically, score the last 30 % \"—\nbecause in production you always have to predict the\n*future*, never a held-out random sample of the past.\n\nWe mirror the validator's cohort-shift logic\n(`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`):\npool train + test, sort by `lead_created_at` with `lead_id`\nas a stable tiebreak, train HistGBM on the first 85 % and\nscore the last 15 % (the validator's `COHORT_TRAIN_FRAC`).\nBoth random and cohort splits use the full feature panel\n**including** the trap, matching the report's posture so\nthe numbers compare directly. The HistGBM uses\n`random_state=0` here (the validator's\n`DEFAULT_MODEL_RANDOM_STATE`) instead of the notebook's\ndefault `SEED=42` — that matters for the cohort-shift\nreproduction down to the third decimal.\n\nThe expected behaviour for the v1 intermediate tier is\n*no* degradation — the report shows the cohort split AUC\nrunning ~0.015 *higher* than the random split. That's a\nsurprise worth surfacing: the v1 simulator's intermediate\nworld doesn't drift over its 90-day horizon, so cohort\norder isn't a stressor here. The intro and advanced\ntiers show small positive degradations (intro +0.016,\nadvanced +0.010) — see\n`release/validation/validation_report.json` ⇒\n`cohort_shift`." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_014", + "metadata": {}, + "outputs": [], + "source": [ + "# Constants mirror leadforge.validation.release_quality so\n", + "# the numbers reproduce the report's cohort-shift block.\n", + "COHORT_TRAIN_FRAC = 0.85\n", + "COHORT_MODEL_SEED = 0\n", + "\n", + "# Cohort-shift uses the validator's full panel (trap kept).\n", + "EXCLUDE_FULL = set(ID_COLS + [TASK])\n", + "full_cols = [c for c in train.columns if c not in EXCLUDE_FULL]\n", + "cat_full = [\n", + " c\n", + " for c in full_cols\n", + " if not (pd.api.types.is_bool_dtype(train[c]) or pd.api.types.is_numeric_dtype(train[c]))\n", + "]\n", + "num_full = [c for c in full_cols if c not in cat_full]\n", + "\n", + "\n", + "def _gbm_pipeline_for_cohort() -> Pipeline:\n", + " # Local builder so the validator's ``model_random_state=0``\n", + " # is used here, while the headline panel above keeps\n", + " # ``random_state=SEED`` for the section-2 LR/GBM models.\n", + " pre = ColumnTransformer(\n", + " [\n", + " (\n", + " \"num\",\n", + " Pipeline(\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"median\")),\n", + " (\"scaler\", StandardScaler()),\n", + " ]\n", + " ),\n", + " num_full,\n", + " ),\n", + " (\n", + " \"cat\",\n", + " Pipeline(\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n", + " (\n", + " \"encoder\",\n", + " OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False),\n", + " ),\n", + " ]\n", + " ),\n", + " cat_full,\n", + " ),\n", + " ],\n", + " remainder=\"drop\",\n", + " )\n", + " clf = HistGradientBoostingClassifier(random_state=COHORT_MODEL_SEED)\n", + " return Pipeline([(\"preprocessor\", pre), (\"classifier\", clf)])\n", + "\n", + "\n", + "# Random split AUC = HistGBM on the bundle's existing split.\n", + "rand_pipe = _gbm_pipeline_for_cohort().fit(_sanitize(train[full_cols], cat_full), y_train)\n", + "random_split_auc = float(\n", + " roc_auc_score(\n", + " y_test,\n", + " rand_pipe.predict_proba(_sanitize(test[full_cols], cat_full))[:, 1],\n", + " )\n", + ")\n", + "\n", + "# Chronological resplit: pool, sort by lead_created_at +\n", + "# lead_id (stable tiebreak), take first 85 % as train, last\n", + "# 15 % as test. Mirrors ``measure_cohort_shift_from_bundle``.\n", + "pooled = pd.concat([train, test], ignore_index=True)\n", + "ts = pd.to_datetime(pooled[\"lead_created_at\"], errors=\"coerce\")\n", + "assert not ts.isna().any(), \"expected every lead to have a parseable lead_created_at\"\n", + "sort_frame = pd.DataFrame({\"_ts\": ts.values, \"_lid\": pooled[\"lead_id\"].astype(str).values})\n", + "order = sort_frame.sort_values([\"_ts\", \"_lid\"], kind=\"stable\").index.to_numpy()\n", + "cutoff = int(round(len(pooled) * COHORT_TRAIN_FRAC))\n", + "early = pooled.iloc[order[:cutoff]]\n", + "late = pooled.iloc[order[cutoff:]]\n", + "y_early = early[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", + "y_late = late[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", + "\n", + "cohort_pipe = _gbm_pipeline_for_cohort().fit(_sanitize(early[full_cols], cat_full), y_early)\n", + "cohort_split_auc = float(\n", + " roc_auc_score(\n", + " y_late,\n", + " cohort_pipe.predict_proba(_sanitize(late[full_cols], cat_full))[:, 1],\n", + " )\n", + ")\n", + "auc_degradation = random_split_auc - cohort_split_auc\n", + "print(f\"random_split_auc: {random_split_auc:.4f}\")\n", + "print(f\"cohort_split_auc: {cohort_split_auc:.4f}\")\n", + "print(f\"auc_degradation: {auc_degradation:+.4f} (positive = cohort is harder)\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_015", + "metadata": {}, + "source": "## 8. Bootstrap robustness — within-bundle metric variance\n\nCross-seed metric variance (the validation report's\n`tiers.intermediate.spreads.gbm_auc = 0.027`) is the\ncleanest answer to \"how confident is this AUC?\", but it\nrequires regenerating the bundle from N seeds — something\na public-bundle consumer (Kaggle / HF) can't easily do.\n\nThe within-bundle proxy is **non-parametric bootstrap of\nthe test set**. We resample the 750 test rows with\nreplacement, re-rank using the model probabilities we\nalready have, and recompute AUC / AP. 200 resamples is\nenough to read a confidence band off the distribution.\n\nThe bootstrap variance is **smaller** than the cross-seed\nvariance — it captures sampling noise on a single\ngenerated world, not generation-process noise across\nseeds — but it's the right number for the question\n\"given *this* test set, how stable is the AUC?\"" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_016", + "metadata": {}, + "outputs": [], + "source": [ + "N_BOOT = 200\n", + "rng = np.random.default_rng(SEED)\n", + "\n", + "boot_lr_auc = np.empty(N_BOOT)\n", + "boot_gbm_auc = np.empty(N_BOOT)\n", + "boot_lr_ap = np.empty(N_BOOT)\n", + "n_test = len(y_test)\n", + "for i in range(N_BOOT):\n", + " idx = rng.integers(0, n_test, n_test)\n", + " if y_test[idx].sum() == 0 or y_test[idx].sum() == n_test:\n", + " # Degenerate resample — re-roll.\n", + " boot_lr_auc[i] = np.nan\n", + " boot_gbm_auc[i] = np.nan\n", + " boot_lr_ap[i] = np.nan\n", + " continue\n", + " boot_lr_auc[i] = roc_auc_score(y_test[idx], lr_probs[idx])\n", + " boot_gbm_auc[i] = roc_auc_score(y_test[idx], gbm_probs[idx])\n", + " boot_lr_ap[i] = average_precision_score(y_test[idx], lr_probs[idx])\n", + "\n", + "\n", + "def _summary(arr: np.ndarray, name: str) -> None:\n", + " arr = arr[~np.isnan(arr)]\n", + " lo, med, hi = np.quantile(arr, [0.025, 0.5, 0.975])\n", + " print(\n", + " f\" {name:<14s} median={med:.4f} \"\n", + " f\"95% CI=[{lo:.4f}, {hi:.4f}] IQR={(np.quantile(arr, 0.75) - np.quantile(arr, 0.25)):.4f}\"\n", + " )\n", + "\n", + "\n", + "print(f\"bootstrap on test set, n_iters={N_BOOT}, seed={SEED}:\")\n", + "_summary(boot_lr_auc, \"LR AUC\")\n", + "_summary(boot_gbm_auc, \"GBM AUC\")\n", + "_summary(boot_lr_ap, \"LR AP\")\n", + "\n", + "fig, ax = plt.subplots(figsize=(7, 4))\n", + "ax.hist(boot_lr_auc, bins=30, color=\"#3b82f6\", alpha=0.7, label=\"LR AUC\")\n", + "ax.hist(boot_gbm_auc, bins=30, color=\"#9ca3af\", alpha=0.7, label=\"GBM AUC\")\n", + "ax.axvline(roc_auc_score(y_test, lr_probs), color=\"#1d4ed8\", linestyle=\"--\", label=\"LR (point)\")\n", + "ax.axvline(roc_auc_score(y_test, gbm_probs), color=\"#374151\", linestyle=\"--\", label=\"GBM (point)\")\n", + "ax.set_xlabel(\"AUC\")\n", + "ax.set_ylabel(\"# bootstrap draws\")\n", + "ax.set_title(f\"Bootstrap AUC distribution (n={N_BOOT})\")\n", + "ax.legend(loc=\"upper left\", fontsize=8)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "cell_017", + "metadata": {}, + "source": "## 9. Tolerance gate (G13.2)\n\nThree groups of pinned values:\n\n* **Cohort-shift block** — pinned to\n `release/notebooks/_release_targets.json`'s\n `cohort_shift.intermediate`, which is itself audit-synced\n against `validation_report.json`'s `cohort_shift.intermediate`\n by `tests/release/notebooks/test_release_targets_match_report.py`.\n That audit-sync is what makes the \"this notebook\n reproduces the report\" claim meaningful.\n* **Calibration / lift / value-capture** — pinned inline\n against the seed-42 single-run values from the\n validation report's `per_seed[0]` block. Tolerances\n widen for small-K metrics (P@K, value capture) because\n their seed-to-seed variance is larger.\n* **Bootstrap medians** — pinned inline against the\n seed-42 point estimates (the bootstrap median converges\n to the data-specific value, not to the cross-seed\n median).\n\nThe headline lift sign-check (`gbm_auc > lr_auc - eps` was\n*not* asserted — the v1 dataset documents the surprising\nfinding that LR ≥ GBM on intermediate; see\n`release/validation/validation_report.md` gate G7.4.4)." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_018", + "metadata": {}, + "outputs": [], + "source": [ + "with (Path.cwd() / \"_release_targets.json\").open() as fh:\n", + " release_targets = json.load(fh)\n", + "cohort_targets = release_targets[\"cohort_shift\"][\"intermediate\"]\n", + "\n", + "cohort_observed = {\n", + " \"random_split_auc\": random_split_auc,\n", + " \"cohort_split_auc\": cohort_split_auc,\n", + " \"auc_degradation\": auc_degradation,\n", + "}\n", + "assert_within_tolerance(\n", + " observed=cohort_observed,\n", + " target=cohort_targets,\n", + " tolerances={\n", + " # ±0.02 on AUCs — well outside numerical jitter,\n", + " # well inside the band that would let the\n", + " # cohort-shift sign flip silently.\n", + " \"random_split_auc\": 0.02,\n", + " \"cohort_split_auc\": 0.02,\n", + " # Wider on the difference because both AUCs are\n", + " # within tolerance, so the difference can drift up\n", + " # to ±0.04 in the worst case.\n", + " \"auc_degradation\": 0.04,\n", + " },\n", + " label=\"notebook 04 cohort-shift vs validation_report (intermediate)\",\n", + ")\n", + "\n", + "# Inline pins for the seed-42 single-run values *of the\n", + "# without-trap headline panel*. These are not the report's\n", + "# published numbers (the report keeps the trap) — the\n", + "# report-level pin lives in section 9's cohort-shift block,\n", + "# which is the only metric this notebook reproduces against\n", + "# the report. Notebook 02 trains the same trap-dropped LR\n", + "# and reports the same AUCs, so these values are also\n", + "# cross-checked there.\n", + "NB04_TARGETS = {\n", + " \"lr_auc\": 0.8737,\n", + " \"gbm_auc\": 0.8432,\n", + " \"lr_max_bin_err\": 0.1344,\n", + " \"lift_at_5pct\": 2.4819,\n", + " \"lift_at_10pct\": 2.7536,\n", + " \"acv_cap_50\": 0.1615,\n", + " \"acv_cap_100\": 0.3702,\n", + " # Bootstrap medians converge to the seed-42 point\n", + " # estimates within sampling noise.\n", + " \"boot_lr_auc_median\": 0.8757,\n", + " \"boot_gbm_auc_median\": 0.8440,\n", + "}\n", + "NB04_TOLERANCES = {\n", + " \"lr_auc\": 0.02,\n", + " \"gbm_auc\": 0.02,\n", + " \"lr_max_bin_err\": 0.05,\n", + " \"lift_at_5pct\": 0.30,\n", + " \"lift_at_10pct\": 0.30,\n", + " \"acv_cap_50\": 0.05,\n", + " \"acv_cap_100\": 0.05,\n", + " \"boot_lr_auc_median\": 0.03,\n", + " \"boot_gbm_auc_median\": 0.03,\n", + "}\n", + "observed = {\n", + " \"lr_auc\": float(roc_auc_score(y_test, lr_probs)),\n", + " \"gbm_auc\": float(roc_auc_score(y_test, gbm_probs)),\n", + " \"lr_max_bin_err\": float(max_bin_err),\n", + " \"lift_at_5pct\": lifts[5.0],\n", + " \"lift_at_10pct\": lifts[10.0],\n", + " \"acv_cap_50\": acv_capture(lr_probs, 50),\n", + " \"acv_cap_100\": acv_capture(lr_probs, 100),\n", + " \"boot_lr_auc_median\": float(np.nanmedian(boot_lr_auc)),\n", + " \"boot_gbm_auc_median\": float(np.nanmedian(boot_gbm_auc)),\n", + "}\n", + "assert_within_tolerance(\n", + " observed=observed,\n", + " target=NB04_TARGETS,\n", + " tolerances=NB04_TOLERANCES,\n", + " label=\"notebook 04 metric panel (seed 42, intermediate)\",\n", + ")\n", + "\n", + "# Sign-aware: value-aware ranking should not be worse than\n", + "# P-only ranking on aggregate. The headline finding stays\n", + "# in the narrative regardless of the exact numbers.\n", + "for k, gain in value_gains.items():\n", + " assert gain >= -0.01, (\n", + " f\"value-aware ranking lost ground at top-{k} ({gain:+.4f}); \"\n", + " \"the P×ACV story is no longer load-bearing\"\n", + " )\n", + "print(\"OK — cohort-shift, calibration, lift, value-capture, and bootstrap all pinned.\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_019", + "metadata": {}, + "source": "## 10. Summary\n\n* The LR baseline is well-calibrated (max bin error ≈ 0.19\n on this seed) and lifts the top decile to ~2.6× the base\n rate.\n* Value-aware ranking (P × ACV) captures more revenue per\n top-K slot than P-only ranking — the gap depends on K\n but is positive across all sizes we tested.\n* Cohort shift is **negative** on the intermediate tier\n (the late cohort is *easier*, not harder); the report\n documents this, and the notebook reproduces it. The\n intro and advanced tiers show small positive\n degradations.\n* Bootstrap on the existing test split gives a within-\n bundle confidence band that's tighter than the cross-seed\n spread the validation report computes — useful for \"how\n confident is this single AUC\" questions, not for \"how\n much does the bundle move across seeds.\"\n\n## Where to go next\n\n1. Try cohort-shifted training in production: refit weekly\n on the trailing 60-day window, score the next 7 days.\n2. If you have real ACV data, swap the `expected_acv`\n heuristic for it and recompute section 5 — the revenue\n capture story should sharpen.\n3. The break-me playbook in `docs/release/break_me_guide.md`\n (coming in PR 6.3) catalogues additional stress tests\n (target-encoding leakage, train-test contamination,\n cohort-by-segment) and how to detect each from a\n single bundle." + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/release/notebooks/_release_targets.json b/release/notebooks/_release_targets.json index 7ed4cdb..9d36f7c 100644 --- a/release/notebooks/_release_targets.json +++ b/release/notebooks/_release_targets.json @@ -1,5 +1,13 @@ { "_doc": "Cross-seed-median metric values from release/validation/validation_report.json, sliced to the metrics the release notebooks pin via assert_within_tolerance. Audited against the report by tests/release/notebooks/test_release_targets_match_report.py — if you change a value here, the test will fail unless the corresponding median in the validation report changes to match.", + "cohort_shift": { + "_doc": "Per-tier cohort-shift metrics from validation_report.cohort_shift (single-seed values; the report runs cohort-shift only on seed 42). Notebook 04 reproduces these via a chronological resplit and pins them via assert_within_tolerance.", + "intermediate": { + "auc_degradation": -0.015458147938307687, + "cohort_split_auc": 0.8908394607843138, + "random_split_auc": 0.8753813128460061 + } + }, "intermediate": { "brier_score": 0.10963449613199748, "gbm_auc": 0.875461913160326, diff --git a/scripts/build_release_notebook_03.py b/scripts/build_release_notebook_03.py new file mode 100644 index 0000000..3df45bb --- /dev/null +++ b/scripts/build_release_notebook_03.py @@ -0,0 +1,542 @@ +"""One-shot builder for ``release/notebooks/03_leakage_and_time_windows.ipynb``. + +Run from the repository root:: + + python scripts/build_release_notebook_03.py + +Cells are assigned deterministic IDs by ``_release_notebook_common`` so +re-running yields a byte-identical file — same audit-artifact-sync +pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts. +""" + +from __future__ import annotations + +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).resolve().parent)) + +import nbformat as nbf # noqa: E402 — must follow sys.path insert +from _release_notebook_common import ( # noqa: E402 — must follow sys.path insert + assemble_notebook, + builder_arg_parser, + code, + md, + write_notebook, +) + +DEFAULT_OUT = ( + Path(__file__).resolve().parents[1] + / "release" + / "notebooks" + / "03_leakage_and_time_windows.ipynb" +) + + +def cells() -> list[nbf.NotebookNode]: + return [ + md( + """ + # Notebook 03 — Leakage and Time Windows + + **Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier. + + The bundle ships with one **deliberate leakage trap**: + `total_touches_all`. The feature dictionary marks it + `leakage_risk = True`; the dataset card calls it out; + notebook 01 keeps it (matching the validation report's + panel) while notebook 02 drops it. This notebook turns the + trap into a teaching moment. + + We do four things: + + 1. **Read the receipts.** The trap is documented in the + feature dictionary. We surface that label. + 2. **Time-window proof.** Quantify how much + `total_touches_all` differs from its snapshot-safe + sibling `touch_count` — the difference is post-snapshot + information by construction, regardless of how + predictive that information turns out to be. + 3. **The lesson.** Run a single-column standalone-AUC probe + on the trap (it looks innocuous, ~0.53). Then run the + full-panel ± trap comparison: HistGBM extracts a + substantial AUC lift (+0.03) from a column whose + standalone AUC is barely above chance. Standalone + probes undersell tree-friendly leakage. + 4. **Pin the deltas.** Sign-aware tolerance gates so a + future regeneration that neutralises the trap (or + accidentally amplifies it) breaks CI. + + **Public path discipline (G13.3).** This notebook reads only + from `release/intermediate/` (the public student bundle). The + instructor companion is **not** loaded — leakage detection + has to work from the public artefact alone, since that's all + a downstream consumer ever has. + """ + ), + md("## 1. Setup"), + code( + """ + from __future__ import annotations + + import json + import sys + from pathlib import Path + + import matplotlib.pyplot as plt + import numpy as np + import pandas as pd + from sklearn.compose import ColumnTransformer + from sklearn.ensemble import HistGradientBoostingClassifier + from sklearn.impute import SimpleImputer + from sklearn.linear_model import LogisticRegression + from sklearn.metrics import roc_auc_score + from sklearn.pipeline import Pipeline + from sklearn.preprocessing import OneHotEncoder, StandardScaler + + sys.path.insert(0, str(Path.cwd())) + from _notebook_utils import assert_within_tolerance, precision_at_k + + SEED = 42 + BUNDLE = Path("../intermediate") # public student bundle + TASK = "converted_within_90_days" + TRAP = "total_touches_all" + + with (BUNDLE / "manifest.json").open() as fh: + manifest = json.load(fh) + assert manifest["exposure_mode"] == "student_public" + assert manifest["relational_snapshot_safe"] is True + SNAPSHOT_DAY = int(manifest["snapshot_day"]) + HORIZON_DAYS = int(manifest["horizon_days"]) + print(f"snapshot_day = {SNAPSHOT_DAY} horizon_days = {HORIZON_DAYS}") + """ + ), + md( + """ + ## 2. The trap, as the feature dictionary calls it out + + The release ships a `feature_dictionary.csv` next to the + data. Any column with `leakage_risk = True` is flagged as a + **deliberate teaching trap** — included so users can practise + detecting it, with the trap's nature documented inline. + + Treat the feature dictionary as the first place you look on + any new dataset. A column named `total_touches_all` is not + obviously bad until the dictionary tells you it counts + touches over the full 90-day horizon, well past the + 30-day snapshot anchor that defines the prediction time. + """ + ), + code( + """ + feat_dict = pd.read_csv(BUNDLE / "feature_dictionary.csv") + traps = feat_dict[feat_dict["leakage_risk"].astype(bool)] + print(f"trap columns flagged in feature_dictionary.csv: {len(traps)}") + for _, row in traps.iterrows(): + print(f" {row['name']}: {row['description']}") + assert TRAP in set(traps["name"]), f"{TRAP} expected to be flagged in dictionary" + """ + ), + md( + """ + ## 3. Time-window proof — the trap *by construction* + + The dictionary *says* `total_touches_all` uses post-snapshot + data. We verify that on the same row that carries the trap: + the task table also carries `touch_count`, the + **snapshot-safe** touch aggregate (filtered to + `touch_timestamp <= lead_created_at + snapshot_day`). Their + difference is the **post-snapshot delta** — by construction, + information from days 31–90 that the model should never see + when scoring at day 30. + + The pedagogical point is independent of how predictive that + difference turns out to be. **A column that uses + post-snapshot data is invalid at scoring time even when it + looks unpredictive in isolation.** Section 4 measures that + "looks unpredictive in isolation" claim directly, then + section 5 shows it can be misleading. + + We pool all three task splits so the receipt covers every + lead in the bundle. + """ + ), + code( + """ + train = pd.read_parquet(BUNDLE / "tasks" / TASK / "train.parquet") + valid = pd.read_parquet(BUNDLE / "tasks" / TASK / "valid.parquet") + test = pd.read_parquet(BUNDLE / "tasks" / TASK / "test.parquet") + + all_leads = pd.concat([train, valid, test], ignore_index=True) + assert all_leads["lead_id"].is_unique, ( + "expected one row per lead across train/valid/test" + ) + + window = all_leads[["lead_id", TRAP, "touch_count", TASK]].copy() + window[TRAP] = pd.to_numeric(window[TRAP], errors="coerce") + window["touch_count"] = pd.to_numeric(window["touch_count"], errors="coerce") + window = window.dropna(subset=[TRAP, "touch_count"]).copy() + window["post_snapshot_touches"] = window[TRAP] - window["touch_count"] + window[TASK] = window[TASK].astype("boolean").fillna(False).astype(int) + + print(f"leads used in this section: {len(window):,}") + print( + f" {TRAP:<22s} mean={window[TRAP].mean():6.2f} " + f"max={int(window[TRAP].max()):>4d}" + ) + print( + f" {'touch_count (snapshot-safe)':<22s} " + f"mean={window['touch_count'].mean():6.2f} " + f"max={int(window['touch_count'].max()):>4d}" + ) + mean_delta = float(window["post_snapshot_touches"].mean()) + n_post = int((window["post_snapshot_touches"] > 0).sum()) + print( + f" {'post-snapshot delta':<22s} " + f"mean={mean_delta:6.2f} " + f"max={int(window['post_snapshot_touches'].max()):>4d}" + ) + print( + f" → {n_post:,} of {len(window):,} leads " + f"({n_post / len(window):.1%}) have a positive post-snapshot delta" + ) + assert mean_delta > 0, ( + "expected a positive mean post-snapshot delta — if zero, the trap may " + "have been silently rebuilt as a snapshot-safe aggregate" + ) + """ + ), + md( + """ + ### 3.1 The post-snapshot delta is uncorrelated with the label *on this dataset* + + On the v1 procurement world, the count of touches between + day 30 and day 90 turns out to be roughly the same for + converted and non-converted leads — sales reps keep working + both groups for a while before the funnel settles. A + stronger world (more aggressive sales follow-up on hot + leads) would split these apart; this one doesn't. + + The plot below makes that lack-of-split visible. The trap + is *still a trap* — we just can't tell that from the + post-snapshot delta alone, which is why the validation + report's `post_snapshot_aggregates` baseline (a single- + column probe) gives an AUC of only ~0.55. The real damage + shows up when a tree model gets to combine the trap with + other columns; section 5 measures that. + """ + ), + code( + """ + grouped = window.groupby(TASK)["post_snapshot_touches"].agg(["mean", "median", "count"]) + grouped.index = grouped.index.map({0: "non-converted", 1: "converted"}) + print(grouped) + + fig, ax = plt.subplots(figsize=(6, 4)) + data = [ + window.loc[window[TASK] == 0, "post_snapshot_touches"], + window.loc[window[TASK] == 1, "post_snapshot_touches"], + ] + ax.boxplot(data, tick_labels=["non-converted", "converted"], showfliers=False) + ax.set_ylabel("post-snapshot touches (total_touches_all − touch_count)") + ax.set_title("Post-snapshot delta by label\\n(roughly the same — section 5 explains why this is misleading)") + plt.tight_layout() + plt.show() + """ + ), + md( + """ + ## 4. Standalone-AUC probe (the audit that almost lets the trap pass) + + A common leakage audit is to fit a one-feature classifier on + each suspect column and report the standalone AUC. The + validation report does this at scale — its + `post_snapshot_aggregates` baseline trains a model on the + single column `total_touches_all` and reports an AUC around + 0.55. That sounds tame, and on a busy schedule it's tempting + to clear the column on those grounds. + + We re-run the probe here so you've seen the number with your + own eyes: ~0.53. If that's all you measure, the trap looks + barely worth mentioning. Section 5 shows what that audit + misses. + """ + ), + code( + """ + standalone_window = window.dropna(subset=[TRAP, "touch_count"]).copy() + y = standalone_window[TASK].astype(int).to_numpy() + standalone = { + TRAP: float(roc_auc_score(y, standalone_window[TRAP].to_numpy())), + "touch_count (snapshot-safe)": float( + roc_auc_score(y, standalone_window["touch_count"].to_numpy()) + ), + "post-snapshot delta": float( + roc_auc_score(y, standalone_window["post_snapshot_touches"].to_numpy()) + ), + } + print(f"{'feature':<32s} {'standalone AUC':>16s}") + for name, auc in standalone.items(): + print(f" {name:<30s} {auc:>16.4f}") + """ + ), + md( + """ + ## 5. Side-by-side AUC: full panel ± trap + + Train two HistGBM and two Logistic Regression baselines on + the **same train/test split, same model, same seed** — + the only thing that varies is whether `total_touches_all` + is in the column list. + + We use the full as-shipped feature panel (every public + snapshot column except IDs / label) as the baseline. This + mirrors notebook 01 / the validation report's setup, so the + with-trap AUC reproduces the report's published number and + the without-trap AUC is what notebook 02 starts from. + """ + ), + code( + """ + ID_COLS = ["account_id", "contact_id", "lead_id", "lead_created_at"] + EXCLUDE = set(ID_COLS + [TASK]) + + full_cols = [c for c in train.columns if c not in EXCLUDE] + full_cols_no_trap = [c for c in full_cols if c != TRAP] + print(f"full panel: {len(full_cols)} cols (incl. {TRAP})") + print(f"full panel no trap: {len(full_cols_no_trap)} cols") + + def _split_cols(df: pd.DataFrame, cols: list[str]) -> tuple[list[str], list[str]]: + cat = [ + c + for c in cols + if not ( + pd.api.types.is_bool_dtype(df[c]) + or pd.api.types.is_numeric_dtype(df[c]) + ) + ] + num = [c for c in cols if c not in cat] + return num, cat + + def _sanitize(df: pd.DataFrame, cat_cols: list[str]) -> pd.DataFrame: + out = df.copy() + for c in cat_cols: + out[c] = out[c].astype(object).where(out[c].notna(), None) + return out + + def _build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline: + num_t = Pipeline( + [ + ("imputer", SimpleImputer(strategy="median")), + ("scaler", StandardScaler()), + ] + ) + cat_t = Pipeline( + [ + ("imputer", SimpleImputer(strategy="most_frequent")), + ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)), + ] + ) + pre = ColumnTransformer( + [("num", num_t, num_cols), ("cat", cat_t, cat_cols)], + remainder="drop", + ) + if model == "lr": + clf = LogisticRegression(max_iter=1000, solver="lbfgs", random_state=SEED) + else: + clf = HistGradientBoostingClassifier(random_state=SEED) + return Pipeline([("preprocessor", pre), ("classifier", clf)]) + + def fit_score(cols: list[str], *, model: str) -> np.ndarray: + num_cols, cat_cols = _split_cols(train, cols) + pipe = _build_pipeline(num_cols, cat_cols, model=model) + pipe.fit(_sanitize(train[cols], cat_cols), y_train) + return pipe.predict_proba(_sanitize(test[cols], cat_cols))[:, 1] + + y_train = train[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + y_test = test[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + base_rate = float(y_test.mean()) + print(f"test base rate: {base_rate:.3f}") + """ + ), + code( + """ + results: dict[str, dict[str, float]] = {} + for model in ("lr", "gbm"): + p_with = fit_score(full_cols, model=model) + p_without = fit_score(full_cols_no_trap, model=model) + results[model] = { + "with_trap_auc": float(roc_auc_score(y_test, p_with)), + "without_trap_auc": float(roc_auc_score(y_test, p_without)), + "with_trap_p100": precision_at_k(p_with, y_test, 100), + "without_trap_p100": precision_at_k(p_without, y_test, 100), + } + + print(f"{'model':<5s} {'with trap':>10s} {'without trap':>13s} {'Δ AUC':>8s}") + for m, r in results.items(): + d = r["with_trap_auc"] - r["without_trap_auc"] + print( + f"{m:<5s} {r['with_trap_auc']:>10.4f} " + f"{r['without_trap_auc']:>13.4f} {d:+8.4f}" + ) + """ + ), + md( + """ + ### 5.1 The lesson — standalone AUC underestimates trap impact + + Section 4 says `total_touches_all` is barely above chance + (~0.53 AUC) on its own. Section 5 says HistGBM extracts a + sizeable lift (~+0.03 AUC) from the same column once it can + combine it with the rest of the feature panel. Both + measurements are correct; they just measure different things. + + **Why the gap?** A standalone-AUC probe asks *can this + column rank leads when it's the only signal you have?* A + tree model with the rest of the panel already in scope asks + *can this column refine my existing splits?* The trap's + post-snapshot information correlates with other columns + non-linearly — a few late touches by an outbound rep on an + engaged-but-not-yet-converted lead is a very different + signal from the same touches on a cold lead — and the + tree can carve the join, while a single-feature probe + never sees it. The Logistic Regression gain is much smaller + (~+0.01) for the same reason: it cannot represent that + interaction structure. + + Bar chart below highlights the asymmetry. + """ + ), + code( + """ + labels = ["GBM full", "LR full"] + deltas = [ + results["gbm"]["with_trap_auc"] - results["gbm"]["without_trap_auc"], + results["lr"]["with_trap_auc"] - results["lr"]["without_trap_auc"], + ] + colors = ["#3b82f6", "#9ca3af"] + fig, ax = plt.subplots(figsize=(6, 4)) + ax.bar(range(len(labels)), deltas, color=colors) + ax.axhline(0.0, color="#1f2937", linewidth=0.8) + ax.axhline( + standalone[TRAP] - 0.5, + color="#ef4444", + linestyle="--", + label=f"standalone-AUC excess ({standalone[TRAP] - 0.5:+.3f})", + ) + ax.set_xticks(range(len(labels))) + ax.set_xticklabels(labels) + ax.set_ylabel("ΔAUC = with_trap − without_trap") + ax.set_title("Trap impact — tree models extract more than the probe predicts") + ax.legend(loc="best", fontsize=8) + plt.tight_layout() + plt.show() + """ + ), + md( + """ + ## 6. Tolerance gate (G13.2) + + Single-seed (seed=42) AUCs and trap deltas observed on the + as-shipped intermediate bundle. Tolerances pin each AUC to + within ±0.02 (well outside numerical jitter, well inside the + band that would hide a regression). The sign-aware + assertion below makes the pedagogical claim load-bearing: + if a regeneration ever neutralises the GBM trap-delta, this + fails — even if the absolute AUCs stay inside their bands. + """ + ), + code( + """ + NB03_TARGETS = { + "lr_with_trap_auc": 0.8827, + "lr_without_trap_auc": 0.8737, + "gbm_with_trap_auc": 0.8754, + "gbm_without_trap_auc": 0.8432, + "trap_standalone_auc": 0.5310, + } + NB03_TOLERANCES = dict.fromkeys(NB03_TARGETS, 0.02) + + observed = { + "lr_with_trap_auc": results["lr"]["with_trap_auc"], + "lr_without_trap_auc": results["lr"]["without_trap_auc"], + "gbm_with_trap_auc": results["gbm"]["with_trap_auc"], + "gbm_without_trap_auc": results["gbm"]["without_trap_auc"], + "trap_standalone_auc": standalone[TRAP], + } + assert_within_tolerance( + observed=observed, + target=NB03_TARGETS, + tolerances=NB03_TOLERANCES, + label="notebook 03 trap-panel AUCs (seed 42, intermediate)", + ) + + # Sign-aware: GBM must extract a meaningful lift from the + # trap. Threshold sits well below the seed-42 observation + # (~+0.032) but well above LR's +0.009, so it specifically + # guards the tree-model lift the section-5 narrative claims. + MIN_GBM_LIFT = 0.015 + gbm_lift = ( + results["gbm"]["with_trap_auc"] - results["gbm"]["without_trap_auc"] + ) + assert gbm_lift > MIN_GBM_LIFT, ( + f"GBM trap-lift collapsed: {gbm_lift:+.4f} <= {MIN_GBM_LIFT:.4f} — " + "the trap is no longer carrying the pedagogical lesson in section 5" + ) + print("OK — trap-panel AUCs in tolerance and GBM lift positive.") + """ + ), + md( + """ + ## 7. A detection recipe you can run on any dataset + + The trap was easy to spot here because the dataset + *advertises* it. On a third-party dataset you don't get + that courtesy. The same recipe still works: + + 1. **Read any feature dictionary you have.** Any column + whose description references a window longer than the + prediction horizon is suspicious. Even when no + dictionary ships, an obvious naming smell (`*_total`, + `*_all`, `*_lifetime`) on a 30-day-snapshot dataset is a + flag. + 2. **Probe the standalone AUC** *and* **the contribution to + a tree model.** A standalone probe alone undersells + tree-friendly leakage (sections 4 and 5 demonstrate why + on this dataset). Train a model with the column, train + another without, and compare. The ablation captures + interactions the standalone probe can't. + 3. **Inspect the time window.** Cross-check the suspect + column against any time-stamped event tables. If the + column's value can only be explained by events past the + snapshot anchor, you've found a trap. Section 3 makes + this concrete here — the same technique generalises + anywhere there's an event table to corroborate. + + A walkthrough of additional detection patterns + (column-name heuristics, isolation-via-residuals, + target-encoding leakage on test) lives in + `docs/release/break_me_guide.md` (coming in PR 6.3) — pair + it with this notebook for a more complete playbook. + + ## Next + + - **Notebook 04** — value-aware ranking + (`expected_acv` × P(convert)), calibration plots, + threshold selection for top-K capacity, and a + cohort-shift / bootstrap robustness harness. + """ + ), + ] + + +def main() -> None: + args = builder_arg_parser( + default_out=DEFAULT_OUT, + description="Build release/notebooks/03_leakage_and_time_windows.ipynb", + ).parse_args() + write_notebook(args.out, assemble_notebook(cells())) + + +if __name__ == "__main__": + main() diff --git a/scripts/build_release_notebook_04.py b/scripts/build_release_notebook_04.py new file mode 100644 index 0000000..0c6dbb0 --- /dev/null +++ b/scripts/build_release_notebook_04.py @@ -0,0 +1,822 @@ +"""One-shot builder for ``release/notebooks/04_lift_calibration_value_ranking.ipynb``. + +Run from the repository root:: + + python scripts/build_release_notebook_04.py + +Cells are assigned deterministic IDs by ``_release_notebook_common`` so +re-running yields a byte-identical file — same audit-artifact-sync +pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts. +""" + +from __future__ import annotations + +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).resolve().parent)) + +import nbformat as nbf # noqa: E402 — must follow sys.path insert +from _release_notebook_common import ( # noqa: E402 — must follow sys.path insert + assemble_notebook, + builder_arg_parser, + code, + md, + write_notebook, +) + +DEFAULT_OUT = ( + Path(__file__).resolve().parents[1] + / "release" + / "notebooks" + / "04_lift_calibration_value_ranking.ipynb" +) + + +def cells() -> list[nbf.NotebookNode]: + return [ + md( + """ + # Notebook 04 — Lift, Calibration, Value-Aware Ranking + + **Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier. + + AUC ranks well; that doesn't mean it ranks *for the right + thing*. Sales teams care about three additional concerns + AUC alone never tells you about: + + 1. **Calibration.** Are predicted probabilities trustworthy + as point estimates, or just as a ranking? + 2. **Value-aware ranking.** A 30 %-likely lead worth $200K + is more valuable than a 60 %-likely one worth $20K. + Ranking by P(convert) wastes ACV; ranking by + P(convert) × `expected_acv` doesn't. + 3. **Robustness.** Does the model still work next quarter + (cohort shift)? How tight is the metric on the test set + you have (bootstrap)? + + We answer all three on the public bundle, plus a threshold- + selection walkthrough that maps a fixed sales-capacity + constraint to an operating point. The notebook closes with + a tolerance gate that pins the cohort-shift result to the + published validation report — if a regeneration ever + silently changes the cohort-degradation behaviour, CI + catches it. + + **Public path discipline (G13.3).** Loads only + `release/intermediate/` (the public student bundle). + Instructor-only artefacts (the latent registry, full-horizon + event tables, hidden DAG) are never read. + + **Trap discipline.** The headline LR / GBM panel drops + `total_touches_all` (per notebook 02's leakage discipline) + so the metrics it reports are honest production numbers. + The cohort-shift section deliberately *keeps* the trap to + reproduce the validation report's cohort-shift block — the + report's panel is the as-shipped one, and we want a + comparable number, not a cleaner one. + """ + ), + md("## 1. Setup"), + code( + """ + from __future__ import annotations + + import json + import sys + from pathlib import Path + + import matplotlib.pyplot as plt + import numpy as np + import pandas as pd + from sklearn.compose import ColumnTransformer + from sklearn.ensemble import HistGradientBoostingClassifier + from sklearn.impute import SimpleImputer + from sklearn.linear_model import LogisticRegression + from sklearn.metrics import ( + average_precision_score, + brier_score_loss, + roc_auc_score, + ) + from sklearn.pipeline import Pipeline + from sklearn.preprocessing import OneHotEncoder, StandardScaler + + sys.path.insert(0, str(Path.cwd())) + from _notebook_utils import assert_within_tolerance + + SEED = 42 + BUNDLE = Path("../intermediate") # public student bundle + TASK = "converted_within_90_days" + TRAP = "total_touches_all" + + with (BUNDLE / "manifest.json").open() as fh: + manifest = json.load(fh) + assert manifest["exposure_mode"] == "student_public" + assert manifest["relational_snapshot_safe"] is True + + train = pd.read_parquet(BUNDLE / "tasks" / TASK / "train.parquet") + test = pd.read_parquet(BUNDLE / "tasks" / TASK / "test.parquet") + print(f"train rows: {len(train):,}") + print(f"test rows: {len(test):,}") + """ + ), + md( + """ + ## 2. Train the headline LR + GBM panel + + Same preprocessing as notebooks 01 / 02 (mirrors + `leadforge.validation.release_quality._build_pipeline`). + We drop the documented leakage trap `total_touches_all` + here so the calibration / lift / value plots in sections + 3–6 reflect a honest production model. The cohort-shift + section in section 7 uses the validator's full-panel + posture (trap kept) so its number is comparable to the + published validation report. + """ + ), + code( + """ + ID_COLS = ["account_id", "contact_id", "lead_id", "lead_created_at"] + EXCLUDE_HEADLINE = set(ID_COLS + [TASK, TRAP]) + headline_cols = [c for c in train.columns if c not in EXCLUDE_HEADLINE] + cat_cols = [ + c + for c in headline_cols + if not ( + pd.api.types.is_bool_dtype(train[c]) + or pd.api.types.is_numeric_dtype(train[c]) + ) + ] + num_cols = [c for c in headline_cols if c not in cat_cols] + print(f"headline panel: {len(headline_cols)} cols (trap dropped)") + + def _sanitize(df: pd.DataFrame, cats: list[str]) -> pd.DataFrame: + out = df.copy() + for c in cats: + out[c] = out[c].astype(object).where(out[c].notna(), None) + return out + + def build_pipeline(num: list[str], cat: list[str], *, model: str) -> Pipeline: + pre = ColumnTransformer( + [ + ( + "num", + Pipeline( + [ + ("imputer", SimpleImputer(strategy="median")), + ("scaler", StandardScaler()), + ] + ), + num, + ), + ( + "cat", + Pipeline( + [ + ("imputer", SimpleImputer(strategy="most_frequent")), + ( + "encoder", + OneHotEncoder( + handle_unknown="ignore", sparse_output=False + ), + ), + ] + ), + cat, + ), + ], + remainder="drop", + ) + clf = ( + LogisticRegression(max_iter=1000, solver="lbfgs", random_state=SEED) + if model == "lr" + else HistGradientBoostingClassifier(random_state=SEED) + ) + return Pipeline([("preprocessor", pre), ("classifier", clf)]) + + y_train = train[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + y_test = test[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + base_rate = float(y_test.mean()) + + x_train = _sanitize(train[headline_cols], cat_cols) + x_test = _sanitize(test[headline_cols], cat_cols) + + lr_pipe = build_pipeline(num_cols, cat_cols, model="lr").fit(x_train, y_train) + gbm_pipe = build_pipeline(num_cols, cat_cols, model="gbm").fit(x_train, y_train) + lr_probs = lr_pipe.predict_proba(x_test)[:, 1] + gbm_probs = gbm_pipe.predict_proba(x_test)[:, 1] + + print(f" base rate: {base_rate:.3f}") + print(f" LR AUC: {roc_auc_score(y_test, lr_probs):.4f} " + f"AP: {average_precision_score(y_test, lr_probs):.4f} " + f"Brier: {brier_score_loss(y_test, lr_probs):.4f}") + print(f" GBM AUC: {roc_auc_score(y_test, gbm_probs):.4f} " + f"AP: {average_precision_score(y_test, gbm_probs):.4f} " + f"Brier: {brier_score_loss(y_test, gbm_probs):.4f}") + """ + ), + md( + """ + ## 3. Calibration / reliability diagram + + Bin LR's predicted probabilities into ten equal-width + buckets, plot mean predicted vs mean observed. A perfectly + calibrated model lies on the diagonal; LR after + `StandardScaler + LogisticRegression` is usually close. + We also surface `max_bin_error` — the worst gap across + non-empty bins — which the validation report tracks + (`tiers.intermediate.medians.calibration_max_bin_error`). + """ + ), + code( + """ + edges = np.linspace(0.0, 1.0, 11) + mean_pred: list[float] = [] + mean_actual: list[float] = [] + bin_n: list[int] = [] + for i in range(10): + lo, hi = edges[i], edges[i + 1] + mask = (lr_probs >= lo) & ( + (lr_probs <= hi) if i == 9 else (lr_probs < hi) + ) + if mask.sum() == 0: + continue + mean_pred.append(float(lr_probs[mask].mean())) + mean_actual.append(float(y_test[mask].mean())) + bin_n.append(int(mask.sum())) + + max_bin_err = max( + abs(p - a) for p, a in zip(mean_pred, mean_actual, strict=False) + ) + print(f"max bin error (LR): {max_bin_err:.4f}") + for p, a, n in zip(mean_pred, mean_actual, bin_n, strict=False): + print(f" pred={p:.3f} actual={a:.3f} n={n:>4d}") + + fig, ax = plt.subplots(figsize=(5, 5)) + ax.plot([0, 1], [0, 1], color="#9ca3af", linestyle="--", label="perfect calibration") + ax.plot(mean_pred, mean_actual, marker="o", color="#3b82f6", label="LR") + ax.set_xlim(0, 1) + ax.set_ylim(0, 1) + ax.set_xlabel("Mean predicted probability") + ax.set_ylabel("Observed conversion rate") + ax.set_title("Calibration — LR, intermediate tier (seed 42)") + ax.legend(loc="upper left") + plt.tight_layout() + plt.show() + """ + ), + md( + """ + ## 4. Lift and cumulative gains + + Two complementary curves: + + * **Cumulative gains** — fraction of positives captured as + you sweep the score threshold. Top 10 % of the ranked + list captures ~26 % of converted leads on this seed (vs + the 10 % a random ranker would catch). + * **Lift at *k* %** — `top_k_conversion_rate / base_rate`. + Lift = 2 means "the top 1 % of leads convert at twice + the base rate." + + Both metrics are in `release/validation/validation_report.json` + (`per_seed[0].cumulative_gains` and `per_seed[0].lift_at_pct`) + so the reproduction is auditable. + """ + ), + code( + """ + order = np.argsort(-lr_probs, kind="stable") + y_sorted = y_test[order] + n = len(y_test) + n_pos = int(y_test.sum()) + + # Cumulative gains: fraction of positives captured by top-pct. + pcts = np.arange(0, 101, 10) + gains = [] + for pct in pcts: + k = max(1, int(round(n * pct / 100.0))) + if pct == 0: + gains.append(0.0) + else: + gains.append(float(y_sorted[:k].sum() / n_pos)) + + # Lift at 1 / 5 / 10 %. + lifts = {} + for pct in [1.0, 5.0, 10.0]: + k = max(1, int(round(n * pct / 100.0))) + lifts[pct] = float(y_sorted[:k].mean() / base_rate) + + for pct, lift in lifts.items(): + print(f" lift @ top {pct:>4.0f}%: {lift:.3f}x") + + fig, axes = plt.subplots(1, 2, figsize=(11, 4)) + axes[0].plot(pcts, gains, marker="o", color="#3b82f6", label="LR") + axes[0].plot([0, 100], [0, 1], color="#9ca3af", linestyle="--", label="random") + axes[0].set_xlabel("Top-pct of ranked leads") + axes[0].set_ylabel("Cumulative conversion capture") + axes[0].set_title("Cumulative gains") + axes[0].legend(loc="lower right") + + axes[1].bar( + [str(int(p)) for p in lifts], + list(lifts.values()), + color="#3b82f6", + ) + axes[1].axhline(1.0, color="#ef4444", linestyle="--", label="random (lift=1)") + axes[1].set_xlabel("Top-pct of ranked leads") + axes[1].set_ylabel("Lift over base rate") + axes[1].set_title("Lift at top-pct") + axes[1].legend() + plt.tight_layout() + plt.show() + """ + ), + md( + """ + ## 5. Value-aware ranking — `expected_acv` × P(convert) + + Sales reps don't have infinite capacity, so the right + objective is rarely "maximise conversion count" — it's + "maximise revenue captured per outreach slot." The bundle + ships an `expected_acv` column (opportunity ACV when + available, else revenue-band midpoint heuristic) which + makes value-aware ranking trivial: + + $$ \\text{score}_\\text{value} = P(\\text{convert}) \\times + \\text{expected\\_acv} $$ + + We compare two top-K policies — rank by P(convert) only + vs rank by score_value — and report + `expected_acv_capture_at_k = sum(acv * y) over top-K / + sum(acv * y) over the whole test`. The validation report's + `per_seed[0].expected_acv_capture_at_k` is the reference. + """ + ), + code( + """ + acv = pd.to_numeric(test["expected_acv"], errors="coerce").fillna(0.0).to_numpy() + value_score = lr_probs * acv + + def acv_capture(scores: np.ndarray, k: int) -> float: + order = np.argsort(-scores, kind="stable") + captured = float(np.sum(acv[order[:k]] * y_test[order[:k]])) + total = float(np.sum(acv * y_test)) + return captured / total if total > 0 else float("nan") + + print(f"{'top-K':<6s} {'cap by P(conv)':>14s} {'cap by P×ACV':>13s} {'gain':>7s}") + value_gains = {} + for k in (50, 100, 200): + cap_p = acv_capture(lr_probs, k) + cap_v = acv_capture(value_score, k) + value_gains[k] = cap_v - cap_p + print(f" top {k:<3d} {cap_p:>14.4f} {cap_v:>13.4f} {cap_v - cap_p:+7.4f}") + + # Plot side-by-side ACV capture for K in 10..300. + ks = np.arange(10, 301, 10) + cap_p = [acv_capture(lr_probs, int(k)) for k in ks] + cap_v = [acv_capture(value_score, int(k)) for k in ks] + fig, ax = plt.subplots(figsize=(7, 4)) + ax.plot(ks, cap_p, marker="o", color="#9ca3af", label="rank by P(convert)") + ax.plot(ks, cap_v, marker="o", color="#3b82f6", label="rank by P(convert)×ACV") + ax.set_xlabel("top-K leads contacted") + ax.set_ylabel("Fraction of converted-ACV captured") + ax.set_title("Value-aware ranking captures more revenue per outreach slot") + ax.legend(loc="lower right") + plt.tight_layout() + plt.show() + """ + ), + md( + """ + ## 6. Threshold selection for fixed top-K capacity + + Sales rarely has the patience for "score everything, run + stats." The realistic ask is: *"My team can work 50 leads + this week. Set a probability threshold that selects ~50 + from the test population."* + + We sweep the probability threshold across the LR score + distribution and report precision / recall / count above + threshold for each step, then pick the threshold whose + count is closest to the requested capacity. + """ + ), + code( + """ + CAPACITY = 50 + + sorted_probs = np.sort(lr_probs)[::-1] + # The K-th highest probability is the smallest threshold that + # admits exactly K leads (ties resolved by score order). + threshold = float(sorted_probs[CAPACITY - 1]) + mask = lr_probs >= threshold + n_above = int(mask.sum()) + prec = float(y_test[mask].mean()) if n_above > 0 else float("nan") + recall = float(y_test[mask].sum() / max(int(y_test.sum()), 1)) + print( + f"capacity={CAPACITY} threshold={threshold:.3f} " + f"actually_above={n_above} precision={prec:.3f} recall={recall:.3f}" + ) + + # Threshold sweep — show what happens around the operating + # point so the threshold choice is informed, not magic. + thresholds = np.linspace( + float(np.quantile(lr_probs, 0.5)), float(np.max(lr_probs)), 30 + ) + counts = [int((lr_probs >= t).sum()) for t in thresholds] + precs = [ + float(y_test[lr_probs >= t].mean()) if (lr_probs >= t).sum() > 0 else 0.0 + for t in thresholds + ] + + fig, axes = plt.subplots(1, 2, figsize=(11, 4)) + axes[0].plot(thresholds, counts, marker="o", color="#3b82f6") + axes[0].axhline(CAPACITY, color="#ef4444", linestyle="--", label=f"capacity={CAPACITY}") + axes[0].axvline(threshold, color="#10b981", linestyle="--", label=f"chosen ({threshold:.3f})") + axes[0].set_xlabel("threshold") + axes[0].set_ylabel("# leads above threshold") + axes[0].set_title("Threshold sweep — count above") + axes[0].legend() + + axes[1].plot(thresholds, precs, marker="o", color="#3b82f6") + axes[1].axhline(base_rate, color="#9ca3af", linestyle="--", label=f"base rate ({base_rate:.3f})") + axes[1].axvline(threshold, color="#10b981", linestyle="--", label=f"chosen ({threshold:.3f})") + axes[1].set_xlabel("threshold") + axes[1].set_ylabel("precision above threshold") + axes[1].set_title("Threshold sweep — precision above") + axes[1].legend() + plt.tight_layout() + plt.show() + """ + ), + md( + """ + ## 7. Cohort-shift evaluation + + The bundle's train/test split is a uniform random split of + leads. A more realistic stress test is "train on the first + 70 % of leads chronologically, score the last 30 % "— + because in production you always have to predict the + *future*, never a held-out random sample of the past. + + We mirror the validator's cohort-shift logic + (`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`): + pool train + test, sort by `lead_created_at` with `lead_id` + as a stable tiebreak, train HistGBM on the first 85 % and + score the last 15 % (the validator's `COHORT_TRAIN_FRAC`). + Both random and cohort splits use the full feature panel + **including** the trap, matching the report's posture so + the numbers compare directly. The HistGBM uses + `random_state=0` here (the validator's + `DEFAULT_MODEL_RANDOM_STATE`) instead of the notebook's + default `SEED=42` — that matters for the cohort-shift + reproduction down to the third decimal. + + The expected behaviour for the v1 intermediate tier is + *no* degradation — the report shows the cohort split AUC + running ~0.015 *higher* than the random split. That's a + surprise worth surfacing: the v1 simulator's intermediate + world doesn't drift over its 90-day horizon, so cohort + order isn't a stressor here. The intro and advanced + tiers show small positive degradations (intro +0.016, + advanced +0.010) — see + `release/validation/validation_report.json` ⇒ + `cohort_shift`. + """ + ), + code( + """ + # Constants mirror leadforge.validation.release_quality so + # the numbers reproduce the report's cohort-shift block. + COHORT_TRAIN_FRAC = 0.85 + COHORT_MODEL_SEED = 0 + + # Cohort-shift uses the validator's full panel (trap kept). + EXCLUDE_FULL = set(ID_COLS + [TASK]) + full_cols = [c for c in train.columns if c not in EXCLUDE_FULL] + cat_full = [ + c + for c in full_cols + if not ( + pd.api.types.is_bool_dtype(train[c]) + or pd.api.types.is_numeric_dtype(train[c]) + ) + ] + num_full = [c for c in full_cols if c not in cat_full] + + def _gbm_pipeline_for_cohort() -> Pipeline: + # Local builder so the validator's ``model_random_state=0`` + # is used here, while the headline panel above keeps + # ``random_state=SEED`` for the section-2 LR/GBM models. + pre = ColumnTransformer( + [ + ( + "num", + Pipeline( + [ + ("imputer", SimpleImputer(strategy="median")), + ("scaler", StandardScaler()), + ] + ), + num_full, + ), + ( + "cat", + Pipeline( + [ + ("imputer", SimpleImputer(strategy="most_frequent")), + ( + "encoder", + OneHotEncoder( + handle_unknown="ignore", sparse_output=False + ), + ), + ] + ), + cat_full, + ), + ], + remainder="drop", + ) + clf = HistGradientBoostingClassifier(random_state=COHORT_MODEL_SEED) + return Pipeline([("preprocessor", pre), ("classifier", clf)]) + + # Random split AUC = HistGBM on the bundle's existing split. + rand_pipe = _gbm_pipeline_for_cohort().fit( + _sanitize(train[full_cols], cat_full), y_train + ) + random_split_auc = float( + roc_auc_score( + y_test, + rand_pipe.predict_proba(_sanitize(test[full_cols], cat_full))[:, 1], + ) + ) + + # Chronological resplit: pool, sort by lead_created_at + + # lead_id (stable tiebreak), take first 85 % as train, last + # 15 % as test. Mirrors ``measure_cohort_shift_from_bundle``. + pooled = pd.concat([train, test], ignore_index=True) + ts = pd.to_datetime(pooled["lead_created_at"], errors="coerce") + assert not ts.isna().any(), "expected every lead to have a parseable lead_created_at" + sort_frame = pd.DataFrame( + {"_ts": ts.values, "_lid": pooled["lead_id"].astype(str).values} + ) + order = sort_frame.sort_values(["_ts", "_lid"], kind="stable").index.to_numpy() + cutoff = int(round(len(pooled) * COHORT_TRAIN_FRAC)) + early = pooled.iloc[order[:cutoff]] + late = pooled.iloc[order[cutoff:]] + y_early = early[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + y_late = late[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + + cohort_pipe = _gbm_pipeline_for_cohort().fit( + _sanitize(early[full_cols], cat_full), y_early + ) + cohort_split_auc = float( + roc_auc_score( + y_late, + cohort_pipe.predict_proba(_sanitize(late[full_cols], cat_full))[:, 1], + ) + ) + auc_degradation = random_split_auc - cohort_split_auc + print(f"random_split_auc: {random_split_auc:.4f}") + print(f"cohort_split_auc: {cohort_split_auc:.4f}") + print(f"auc_degradation: {auc_degradation:+.4f} (positive = cohort is harder)") + """ + ), + md( + """ + ## 8. Bootstrap robustness — within-bundle metric variance + + Cross-seed metric variance (the validation report's + `tiers.intermediate.spreads.gbm_auc = 0.027`) is the + cleanest answer to "how confident is this AUC?", but it + requires regenerating the bundle from N seeds — something + a public-bundle consumer (Kaggle / HF) can't easily do. + + The within-bundle proxy is **non-parametric bootstrap of + the test set**. We resample the 750 test rows with + replacement, re-rank using the model probabilities we + already have, and recompute AUC / AP. 200 resamples is + enough to read a confidence band off the distribution. + + The bootstrap variance is **smaller** than the cross-seed + variance — it captures sampling noise on a single + generated world, not generation-process noise across + seeds — but it's the right number for the question + "given *this* test set, how stable is the AUC?" + """ + ), + code( + """ + N_BOOT = 200 + rng = np.random.default_rng(SEED) + + boot_lr_auc = np.empty(N_BOOT) + boot_gbm_auc = np.empty(N_BOOT) + boot_lr_ap = np.empty(N_BOOT) + n_test = len(y_test) + for i in range(N_BOOT): + idx = rng.integers(0, n_test, n_test) + if y_test[idx].sum() == 0 or y_test[idx].sum() == n_test: + # Degenerate resample — re-roll. + boot_lr_auc[i] = np.nan + boot_gbm_auc[i] = np.nan + boot_lr_ap[i] = np.nan + continue + boot_lr_auc[i] = roc_auc_score(y_test[idx], lr_probs[idx]) + boot_gbm_auc[i] = roc_auc_score(y_test[idx], gbm_probs[idx]) + boot_lr_ap[i] = average_precision_score(y_test[idx], lr_probs[idx]) + + def _summary(arr: np.ndarray, name: str) -> None: + arr = arr[~np.isnan(arr)] + lo, med, hi = np.quantile(arr, [0.025, 0.5, 0.975]) + print( + f" {name:<14s} median={med:.4f} " + f"95% CI=[{lo:.4f}, {hi:.4f}] IQR={(np.quantile(arr,0.75)-np.quantile(arr,0.25)):.4f}" + ) + + print(f"bootstrap on test set, n_iters={N_BOOT}, seed={SEED}:") + _summary(boot_lr_auc, "LR AUC") + _summary(boot_gbm_auc, "GBM AUC") + _summary(boot_lr_ap, "LR AP") + + fig, ax = plt.subplots(figsize=(7, 4)) + ax.hist(boot_lr_auc, bins=30, color="#3b82f6", alpha=0.7, label="LR AUC") + ax.hist(boot_gbm_auc, bins=30, color="#9ca3af", alpha=0.7, label="GBM AUC") + ax.axvline(roc_auc_score(y_test, lr_probs), color="#1d4ed8", linestyle="--", label="LR (point)") + ax.axvline(roc_auc_score(y_test, gbm_probs), color="#374151", linestyle="--", label="GBM (point)") + ax.set_xlabel("AUC") + ax.set_ylabel("# bootstrap draws") + ax.set_title(f"Bootstrap AUC distribution (n={N_BOOT})") + ax.legend(loc="upper left", fontsize=8) + plt.tight_layout() + plt.show() + """ + ), + md( + """ + ## 9. Tolerance gate (G13.2) + + Three groups of pinned values: + + * **Cohort-shift block** — pinned to + `release/notebooks/_release_targets.json`'s + `cohort_shift.intermediate`, which is itself audit-synced + against `validation_report.json`'s `cohort_shift.intermediate` + by `tests/release/notebooks/test_release_targets_match_report.py`. + That audit-sync is what makes the "this notebook + reproduces the report" claim meaningful. + * **Calibration / lift / value-capture** — pinned inline + against the seed-42 single-run values from the + validation report's `per_seed[0]` block. Tolerances + widen for small-K metrics (P@K, value capture) because + their seed-to-seed variance is larger. + * **Bootstrap medians** — pinned inline against the + seed-42 point estimates (the bootstrap median converges + to the data-specific value, not to the cross-seed + median). + + The headline lift sign-check (`gbm_auc > lr_auc - eps` was + *not* asserted — the v1 dataset documents the surprising + finding that LR ≥ GBM on intermediate; see + `release/validation/validation_report.md` gate G7.4.4). + """ + ), + code( + """ + with (Path.cwd() / "_release_targets.json").open() as fh: + release_targets = json.load(fh) + cohort_targets = release_targets["cohort_shift"]["intermediate"] + + cohort_observed = { + "random_split_auc": random_split_auc, + "cohort_split_auc": cohort_split_auc, + "auc_degradation": auc_degradation, + } + assert_within_tolerance( + observed=cohort_observed, + target=cohort_targets, + tolerances={ + # ±0.02 on AUCs — well outside numerical jitter, + # well inside the band that would let the + # cohort-shift sign flip silently. + "random_split_auc": 0.02, + "cohort_split_auc": 0.02, + # Wider on the difference because both AUCs are + # within tolerance, so the difference can drift up + # to ±0.04 in the worst case. + "auc_degradation": 0.04, + }, + label="notebook 04 cohort-shift vs validation_report (intermediate)", + ) + + # Inline pins for the seed-42 single-run values *of the + # without-trap headline panel*. These are not the report's + # published numbers (the report keeps the trap) — the + # report-level pin lives in section 9's cohort-shift block, + # which is the only metric this notebook reproduces against + # the report. Notebook 02 trains the same trap-dropped LR + # and reports the same AUCs, so these values are also + # cross-checked there. + NB04_TARGETS = { + "lr_auc": 0.8737, + "gbm_auc": 0.8432, + "lr_max_bin_err": 0.1344, + "lift_at_5pct": 2.4819, + "lift_at_10pct": 2.7536, + "acv_cap_50": 0.1615, + "acv_cap_100": 0.3702, + # Bootstrap medians converge to the seed-42 point + # estimates within sampling noise. + "boot_lr_auc_median": 0.8757, + "boot_gbm_auc_median": 0.8440, + } + NB04_TOLERANCES = { + "lr_auc": 0.02, + "gbm_auc": 0.02, + "lr_max_bin_err": 0.05, + "lift_at_5pct": 0.30, + "lift_at_10pct": 0.30, + "acv_cap_50": 0.05, + "acv_cap_100": 0.05, + "boot_lr_auc_median": 0.03, + "boot_gbm_auc_median": 0.03, + } + observed = { + "lr_auc": float(roc_auc_score(y_test, lr_probs)), + "gbm_auc": float(roc_auc_score(y_test, gbm_probs)), + "lr_max_bin_err": float(max_bin_err), + "lift_at_5pct": lifts[5.0], + "lift_at_10pct": lifts[10.0], + "acv_cap_50": acv_capture(lr_probs, 50), + "acv_cap_100": acv_capture(lr_probs, 100), + "boot_lr_auc_median": float(np.nanmedian(boot_lr_auc)), + "boot_gbm_auc_median": float(np.nanmedian(boot_gbm_auc)), + } + assert_within_tolerance( + observed=observed, + target=NB04_TARGETS, + tolerances=NB04_TOLERANCES, + label="notebook 04 metric panel (seed 42, intermediate)", + ) + + # Sign-aware: value-aware ranking should not be worse than + # P-only ranking on aggregate. The headline finding stays + # in the narrative regardless of the exact numbers. + for k, gain in value_gains.items(): + assert gain >= -0.01, ( + f"value-aware ranking lost ground at top-{k} ({gain:+.4f}); " + "the P×ACV story is no longer load-bearing" + ) + print("OK — cohort-shift, calibration, lift, value-capture, and bootstrap all pinned.") + """ + ), + md( + """ + ## 10. Summary + + * The LR baseline is well-calibrated (max bin error ≈ 0.19 + on this seed) and lifts the top decile to ~2.6× the base + rate. + * Value-aware ranking (P × ACV) captures more revenue per + top-K slot than P-only ranking — the gap depends on K + but is positive across all sizes we tested. + * Cohort shift is **negative** on the intermediate tier + (the late cohort is *easier*, not harder); the report + documents this, and the notebook reproduces it. The + intro and advanced tiers show small positive + degradations. + * Bootstrap on the existing test split gives a within- + bundle confidence band that's tighter than the cross-seed + spread the validation report computes — useful for "how + confident is this single AUC" questions, not for "how + much does the bundle move across seeds." + + ## Where to go next + + 1. Try cohort-shifted training in production: refit weekly + on the trailing 60-day window, score the next 7 days. + 2. If you have real ACV data, swap the `expected_acv` + heuristic for it and recompute section 5 — the revenue + capture story should sharpen. + 3. The break-me playbook in `docs/release/break_me_guide.md` + (coming in PR 6.3) catalogues additional stress tests + (target-encoding leakage, train-test contamination, + cohort-by-segment) and how to detect each from a + single bundle. + """ + ), + ] + + +def main() -> None: + args = builder_arg_parser( + default_out=DEFAULT_OUT, + description="Build release/notebooks/04_lift_calibration_value_ranking.ipynb", + ).parse_args() + write_notebook(args.out, assemble_notebook(cells())) + + +if __name__ == "__main__": + main() diff --git a/tests/release/notebooks/test_execute_notebooks.py b/tests/release/notebooks/test_execute_notebooks.py index 24beec3..fcf847d 100644 --- a/tests/release/notebooks/test_execute_notebooks.py +++ b/tests/release/notebooks/test_execute_notebooks.py @@ -37,6 +37,8 @@ _NOTEBOOKS = [ "01_baseline_lead_scoring.ipynb", "02_relational_feature_engineering.ipynb", + "03_leakage_and_time_windows.ipynb", + "04_lift_calibration_value_ranking.ipynb", ] diff --git a/tests/release/notebooks/test_release_targets_match_report.py b/tests/release/notebooks/test_release_targets_match_report.py index 70f4edf..a94619a 100644 --- a/tests/release/notebooks/test_release_targets_match_report.py +++ b/tests/release/notebooks/test_release_targets_match_report.py @@ -25,8 +25,11 @@ def test_release_targets_match_validation_report() -> None: report = json.loads(_REPORT_PATH.read_text()) for tier_name, tier_targets in targets.items(): - if tier_name.startswith("_"): - continue # ``_doc`` and any other meta keys + if tier_name.startswith("_") or tier_name == "cohort_shift": + # ``_doc`` and other meta keys; ``cohort_shift`` is checked + # separately below against ``report["cohort_shift"]`` rather + # than against ``report["tiers"][...]["medians"]``. + continue assert tier_name in report["tiers"], ( f"targets file mentions tier {tier_name!r} which is absent from " f"validation_report.json (known tiers: {list(report['tiers'])})" @@ -42,3 +45,39 @@ def test_release_targets_match_validation_report() -> None: f"but validation_report median is {report_medians[metric_name]} — " "regenerate the report or update _release_targets.json" ) + + +def test_cohort_shift_targets_match_validation_report() -> None: + """Audit-sync gate for the ``cohort_shift`` block. + + Notebook 04 reproduces the report's chronological-resplit AUCs and + pins them via ``assert_within_tolerance``. The report stores cohort- + shift metrics under a top-level ``cohort_shift.`` key (single + seed, not a cross-seed median), so the structure differs from the + per-tier ``medians`` block above and warrants its own audit loop. + """ + targets = json.loads(_TARGETS_PATH.read_text()) + cohort_targets = targets.get("cohort_shift", {}) + if not cohort_targets: + return # absent block is permitted; only the contents need to match + + report = json.loads(_REPORT_PATH.read_text()) + report_cohort = report["cohort_shift"] + for tier_name, tier_metrics in cohort_targets.items(): + if tier_name.startswith("_"): + continue + assert tier_name in report_cohort, ( + f"targets cohort_shift mentions tier {tier_name!r} which is absent " + f"from validation_report.json cohort_shift (known: {list(report_cohort)})" + ) + report_block = report_cohort[tier_name] + for metric_name, target_value in tier_metrics.items(): + assert metric_name in report_block, ( + f"cohort_shift.{tier_name}.{metric_name}: pinned in targets file " + f"but absent from validation_report.cohort_shift.{tier_name}" + ) + assert target_value == report_block[metric_name], ( + f"cohort_shift.{tier_name}.{metric_name}: targets file has " + f"{target_value} but validation_report has {report_block[metric_name]} — " + "regenerate the report or update _release_targets.json" + ) diff --git a/tests/scripts/test_release_notebook_builders.py b/tests/scripts/test_release_notebook_builders.py index c539a77..9baded8 100644 --- a/tests/scripts/test_release_notebook_builders.py +++ b/tests/scripts/test_release_notebook_builders.py @@ -30,6 +30,8 @@ _BUILDERS: list[tuple[str, str]] = [ ("build_release_notebook_01.py", "01_baseline_lead_scoring.ipynb"), ("build_release_notebook_02.py", "02_relational_feature_engineering.ipynb"), + ("build_release_notebook_03.py", "03_leakage_and_time_windows.ipynb"), + ("build_release_notebook_04.py", "04_lift_calibration_value_ranking.ipynb"), ] From 25b9ec1511e81747c23afd1f98f0cffe6d18a129 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Thu, 7 May 2026 12:17:46 +0300 Subject: [PATCH 2/3] PR 6.2 self-review pass: fix narrative contradictions, strengthen gates, kill dead code MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Brutal-reviewer pass on PR 6.2 caught real issues. Fixes folded back in before requesting human review. Narrative contradictions / wrong numbers (must-fix) - NB04 §7 opened with "train on the first 70 % chronologically" but the code uses 85/15 (matching the validator's COHORT_TRAIN_FRAC=0.85). Whiplash inside one section. Rewrote the opening to use 85/15 from the start and tightened the explanation of why random_state=0 also matters. - NB04 §10 summary cited max bin error ≈ 0.19 and top-decile lift ≈ 2.6×. Both were copied from with-trap report values; the headline panel drops the trap, and the actual observations are 0.13 and 2.75×. Corrected, with a one-line note distinguishing the trap-dropped vs trap-kept numbers. - NB03 §4 cited the report's standalone-AUC of "0.55" and then observed 0.53 without explaining the gap. Added a sentence pinning the methodological difference: the report fits a full LR pipeline on the single column; the notebook uses the raw column as a score directly. Both fall in "barely above chance"; the difference is real and now owned. Tolerance gates that didn't gate the claim (should-fix) - NB03 trap_standalone_auc was pinned to a ±0.02 band around 0.531 — which silently allows the trap to drop to 0.49 (below chance) while the test still passes, breaking section 5's setup. Added a sign-aware ``assert standalone[TRAP] > 0.51``. - NB03 §3 ``mean_delta > 0`` was performative (passed if even one lead had a positive delta). Hardened to ``mean_delta > 1.0`` AND ``n_post / len(window) > 0.5``, sitting well below the observed 3.22 / 82 % but well above "barely positive." - NB04 §5 value-aware ranking guard was ``gain >= -0.01`` (allows ranking to *lose* ground by up to 0.01). Strengthened to ``gain > MIN_VALUE_GAIN = 0.05`` so a regression that erodes the P×ACV story actually fails CI. - Audit-sync test for the cohort_shift block had ``if not cohort_targets: return`` — silently passed if someone deleted or renamed the block. Made the block required; if notebook 04 ever stops needing it, the test should be deleted, not bypassed. Code quality (should-fix) - NB04 ``acv_capture(scores, k)`` re-sorted the score array on every call (60+ calls × O(N log N) per call). Pre-compute the argsort + cumsum once; every plot point is now a cheap cumulative-array lookup. Function signature changed from ``(scores, k)`` to ``(use_value: bool, k)`` and the §9 tolerance-gate call sites were updated to match. - NB04 §6 narrative promised "precision / recall / count above threshold" but the plots showed only count and precision. Added a third panel for recall above threshold so the prose matches the figure. - NB03 dead code: ``with_trap_p100`` / ``without_trap_p100`` keys were computed via ``precision_at_k(...)`` but never read downstream. Removed both keys and the now-unused ``precision_at_k`` import. - NB03 §4 ``standalone_window = window.dropna(...).copy()`` was a no-op (window was already dropped of NaN in §3). Reuse window directly with an inline comment about why. - NB03 §1 ``HORIZON_DAYS`` was read from the manifest and printed but never used in the narrative. Wove it into the setup print so the 60-day post-snapshot hunting window is explicit before §3 measures it. - NB04 §5 plot title ``"Value-aware ranking captures more revenue per outreach slot"`` was a thesis used as a label. Made it neutral ("ACV capture vs top-K ..."); the conclusion stays in the narrative. All 28 notebook builder + execution + audit-sync tests still pass; 1260/1260 full-suite tests pass; ruff + mypy clean; both notebooks execute end-to-end in <10 s each. Co-Authored-By: Claude Opus 4.7 --- .../03_leakage_and_time_windows.ipynb | 59 ++++++--- .../04_lift_calibration_value_ranking.ipynb | 97 +++++++++----- scripts/build_release_notebook_03.py | 70 +++++++--- scripts/build_release_notebook_04.py | 122 +++++++++++------- .../test_release_targets_match_report.py | 21 ++- 5 files changed, 247 insertions(+), 122 deletions(-) diff --git a/release/notebooks/03_leakage_and_time_windows.ipynb b/release/notebooks/03_leakage_and_time_windows.ipynb index 0be6d36..2130369 100644 --- a/release/notebooks/03_leakage_and_time_windows.ipynb +++ b/release/notebooks/03_leakage_and_time_windows.ipynb @@ -37,7 +37,7 @@ "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", "\n", "sys.path.insert(0, str(Path.cwd()))\n", - "from _notebook_utils import assert_within_tolerance, precision_at_k\n", + "from _notebook_utils import assert_within_tolerance\n", "\n", "SEED = 42\n", "BUNDLE = Path(\"../intermediate\") # public student bundle\n", @@ -50,7 +50,12 @@ "assert manifest[\"relational_snapshot_safe\"] is True\n", "SNAPSHOT_DAY = int(manifest[\"snapshot_day\"])\n", "HORIZON_DAYS = int(manifest[\"horizon_days\"])\n", - "print(f\"snapshot_day = {SNAPSHOT_DAY} horizon_days = {HORIZON_DAYS}\")" + "print(f\"snapshot anchor: day {SNAPSHOT_DAY} of a {HORIZON_DAYS}-day horizon\")\n", + "print(\n", + " f\"any feature aggregating events past day {SNAPSHOT_DAY} \"\n", + " f\"is leaking — that's the whole {HORIZON_DAYS - SNAPSHOT_DAY}-day window \"\n", + " \"we're hunting in section 3\"\n", + ")" ] }, { @@ -119,9 +124,20 @@ " f\" → {n_post:,} of {len(window):,} leads \"\n", " f\"({n_post / len(window):.1%}) have a positive post-snapshot delta\"\n", ")\n", - "assert mean_delta > 0, (\n", - " \"expected a positive mean post-snapshot delta — if zero, the trap may \"\n", - " \"have been silently rebuilt as a snapshot-safe aggregate\"\n", + "# Real gate, not performative: on the as-shipped bundle the\n", + "# mean delta is ~3.2 touches/lead and ~82 % of leads have a\n", + "# positive delta. The thresholds below sit well below those\n", + "# observations but well above \"barely above zero\" — a\n", + "# regeneration that erodes the trap's post-snapshot\n", + "# footprint will fail here even if a single lead still\n", + "# carries a positive delta.\n", + "assert mean_delta > 1.0, (\n", + " f\"mean post-snapshot delta collapsed to {mean_delta:.2f} (<= 1.0) — \"\n", + " \"the trap may have been silently rebuilt as a snapshot-safe aggregate\"\n", + ")\n", + "assert n_post / len(window) > 0.5, (\n", + " f\"only {n_post / len(window):.1%} of leads have a positive \"\n", + " \"post-snapshot delta (< 50 %); the trap's footprint has eroded\"\n", ")" ] }, @@ -160,7 +176,7 @@ "cell_type": "markdown", "id": "cell_009", "metadata": {}, - "source": "## 4. Standalone-AUC probe (the audit that almost lets the trap pass)\n\nA common leakage audit is to fit a one-feature classifier on\neach suspect column and report the standalone AUC. The\nvalidation report does this at scale — its\n`post_snapshot_aggregates` baseline trains a model on the\nsingle column `total_touches_all` and reports an AUC around\n0.55. That sounds tame, and on a busy schedule it's tempting\nto clear the column on those grounds.\n\nWe re-run the probe here so you've seen the number with your\nown eyes: ~0.53. If that's all you measure, the trap looks\nbarely worth mentioning. Section 5 shows what that audit\nmisses." + "source": "## 4. Standalone-AUC probe (the audit that almost lets the trap pass)\n\nA common leakage audit is to fit a one-feature classifier on\neach suspect column and report the standalone AUC. The\nvalidation report does this at scale — its\n`post_snapshot_aggregates` baseline trains a *full LR\npipeline* (median-impute + StandardScaler + LR) on the\nsingle column `total_touches_all` and reports an AUC of\n~0.55. We use a quicker probe here — the raw column\nvalue as a score, no preprocessing — which gives ~0.53 on\nthis seed. The two numbers measure slightly different\nthings (a fitted LR can re-scale and adjust, a raw-value\nranker can't), but both fall in the \"barely above chance\"\nband. On a busy schedule it's tempting to clear the column\non those grounds. Section 5 shows what that audit misses." }, { "cell_type": "code", @@ -169,20 +185,29 @@ "metadata": {}, "outputs": [], "source": [ - "standalone_window = window.dropna(subset=[TRAP, \"touch_count\"]).copy()\n", - "y = standalone_window[TASK].astype(int).to_numpy()\n", + "# ``window`` was already dropped of NaN in section 3, so the\n", + "# raw-value ranker can use it directly.\n", + "y = window[TASK].astype(int).to_numpy()\n", "standalone = {\n", - " TRAP: float(roc_auc_score(y, standalone_window[TRAP].to_numpy())),\n", - " \"touch_count (snapshot-safe)\": float(\n", - " roc_auc_score(y, standalone_window[\"touch_count\"].to_numpy())\n", - " ),\n", - " \"post-snapshot delta\": float(\n", - " roc_auc_score(y, standalone_window[\"post_snapshot_touches\"].to_numpy())\n", - " ),\n", + " TRAP: float(roc_auc_score(y, window[TRAP].to_numpy())),\n", + " \"touch_count (snapshot-safe)\": float(roc_auc_score(y, window[\"touch_count\"].to_numpy())),\n", + " \"post-snapshot delta\": float(roc_auc_score(y, window[\"post_snapshot_touches\"].to_numpy())),\n", "}\n", "print(f\"{'feature':<32s} {'standalone AUC':>16s}\")\n", "for name, auc in standalone.items():\n", - " print(f\" {name:<30s} {auc:>16.4f}\")" + " print(f\" {name:<30s} {auc:>16.4f}\")\n", + "\n", + "# Sign-aware: the section-5 narrative (\"standalone probe\n", + "# sees the trap as predictive-ish, but tree models extract\n", + "# more\") falls apart if the trap drops to chance or below.\n", + "# Lower bound 0.51 sits just above sampling noise; if a\n", + "# regeneration ever puts the trap at or below 0.50, the\n", + "# whole pedagogical setup needs revisiting.\n", + "assert standalone[TRAP] > 0.51, (\n", + " f\"trap standalone AUC collapsed to {standalone[TRAP]:.3f} (<= 0.51); \"\n", + " \"section 5 contrasts the standalone probe with the GBM ablation, \"\n", + " \"and that contrast is empty if the trap is at or below chance\"\n", + ")" ] }, { @@ -275,8 +300,6 @@ " results[model] = {\n", " \"with_trap_auc\": float(roc_auc_score(y_test, p_with)),\n", " \"without_trap_auc\": float(roc_auc_score(y_test, p_without)),\n", - " \"with_trap_p100\": precision_at_k(p_with, y_test, 100),\n", - " \"without_trap_p100\": precision_at_k(p_without, y_test, 100),\n", " }\n", "\n", "print(f\"{'model':<5s} {'with trap':>10s} {'without trap':>13s} {'Δ AUC':>8s}\")\n", diff --git a/release/notebooks/04_lift_calibration_value_ranking.ipynb b/release/notebooks/04_lift_calibration_value_ranking.ipynb index a94d64b..b3eaf40 100644 --- a/release/notebooks/04_lift_calibration_value_ranking.ipynb +++ b/release/notebooks/04_lift_calibration_value_ranking.ipynb @@ -272,32 +272,44 @@ "acv = pd.to_numeric(test[\"expected_acv\"], errors=\"coerce\").fillna(0.0).to_numpy()\n", "value_score = lr_probs * acv\n", "\n", + "# Pre-compute the ranking orders once — argsort is O(N log N)\n", + "# and the order doesn't change as K varies, so the ~30 plot\n", + "# points below should not pay for ~30 sorts.\n", + "total_converted_acv = float(np.sum(acv * y_test))\n", + "assert total_converted_acv > 0, \"no converted-ACV in the test set\"\n", + "order_p = np.argsort(-lr_probs, kind=\"stable\")\n", + "order_v = np.argsort(-value_score, kind=\"stable\")\n", + "captured_p = np.cumsum(acv[order_p] * y_test[order_p]) / total_converted_acv\n", + "captured_v = np.cumsum(acv[order_v] * y_test[order_v]) / total_converted_acv\n", "\n", - "def acv_capture(scores: np.ndarray, k: int) -> float:\n", - " order = np.argsort(-scores, kind=\"stable\")\n", - " captured = float(np.sum(acv[order[:k]] * y_test[order[:k]]))\n", - " total = float(np.sum(acv * y_test))\n", - " return captured / total if total > 0 else float(\"nan\")\n", + "\n", + "def acv_capture(use_value: bool, k: int) -> float:\n", + " # 1-indexed cumulative-capture lookup (k=1 = first slot).\n", + " series = captured_v if use_value else captured_p\n", + " if k <= 0 or k > len(series):\n", + " return float(\"nan\")\n", + " return float(series[k - 1])\n", "\n", "\n", "print(f\"{'top-K':<6s} {'cap by P(conv)':>14s} {'cap by P×ACV':>13s} {'gain':>7s}\")\n", "value_gains = {}\n", "for k in (50, 100, 200):\n", - " cap_p = acv_capture(lr_probs, k)\n", - " cap_v = acv_capture(value_score, k)\n", + " cap_p = acv_capture(False, k)\n", + " cap_v = acv_capture(True, k)\n", " value_gains[k] = cap_v - cap_p\n", " print(f\" top {k:<3d} {cap_p:>14.4f} {cap_v:>13.4f} {cap_v - cap_p:+7.4f}\")\n", "\n", - "# Plot side-by-side ACV capture for K in 10..300.\n", + "# Plot side-by-side ACV capture for K in 10..300. Cheap now\n", + "# — every point is a single cumulative-array lookup.\n", "ks = np.arange(10, 301, 10)\n", - "cap_p = [acv_capture(lr_probs, int(k)) for k in ks]\n", - "cap_v = [acv_capture(value_score, int(k)) for k in ks]\n", + "cap_p_curve = [acv_capture(False, int(k)) for k in ks]\n", + "cap_v_curve = [acv_capture(True, int(k)) for k in ks]\n", "fig, ax = plt.subplots(figsize=(7, 4))\n", - "ax.plot(ks, cap_p, marker=\"o\", color=\"#9ca3af\", label=\"rank by P(convert)\")\n", - "ax.plot(ks, cap_v, marker=\"o\", color=\"#3b82f6\", label=\"rank by P(convert)×ACV\")\n", + "ax.plot(ks, cap_p_curve, marker=\"o\", color=\"#9ca3af\", label=\"rank by P(convert)\")\n", + "ax.plot(ks, cap_v_curve, marker=\"o\", color=\"#3b82f6\", label=\"rank by P(convert)×ACV\")\n", "ax.set_xlabel(\"top-K leads contacted\")\n", "ax.set_ylabel(\"Fraction of converted-ACV captured\")\n", - "ax.set_title(\"Value-aware ranking captures more revenue per outreach slot\")\n", + "ax.set_title(\"ACV capture vs top-K (rank by P only vs P × ACV)\")\n", "ax.legend(loc=\"lower right\")\n", "plt.tight_layout()\n", "plt.show()" @@ -307,7 +319,7 @@ "cell_type": "markdown", "id": "cell_011", "metadata": {}, - "source": "## 6. Threshold selection for fixed top-K capacity\n\nSales rarely has the patience for \"score everything, run\nstats.\" The realistic ask is: *\"My team can work 50 leads\nthis week. Set a probability threshold that selects ~50\nfrom the test population.\"*\n\nWe sweep the probability threshold across the LR score\ndistribution and report precision / recall / count above\nthreshold for each step, then pick the threshold whose\ncount is closest to the requested capacity." + "source": "## 6. Threshold selection for fixed top-K capacity\n\nSales rarely has the patience for \"score everything, run\nstats.\" The realistic ask is: *\"My team can work 50 leads\nthis week. Set a probability threshold that selects ~50\nfrom the test population.\"*\n\nWe sweep the probability threshold across the LR score\ndistribution and report **count, precision, and recall**\nabove threshold for each step, then pick the threshold\nwhose count is closest to the requested capacity." }, { "cell_type": "code", @@ -317,6 +329,7 @@ "outputs": [], "source": [ "CAPACITY = 50\n", + "n_pos_test = max(int(y_test.sum()), 1)\n", "\n", "sorted_probs = np.sort(lr_probs)[::-1]\n", "# The K-th highest probability is the smallest threshold that\n", @@ -325,7 +338,7 @@ "mask = lr_probs >= threshold\n", "n_above = int(mask.sum())\n", "prec = float(y_test[mask].mean()) if n_above > 0 else float(\"nan\")\n", - "recall = float(y_test[mask].sum() / max(int(y_test.sum()), 1))\n", + "recall = float(y_test[mask].sum() / n_pos_test)\n", "print(\n", " f\"capacity={CAPACITY} threshold={threshold:.3f} \"\n", " f\"actually_above={n_above} precision={prec:.3f} recall={recall:.3f}\"\n", @@ -334,27 +347,39 @@ "# Threshold sweep — show what happens around the operating\n", "# point so the threshold choice is informed, not magic.\n", "thresholds = np.linspace(float(np.quantile(lr_probs, 0.5)), float(np.max(lr_probs)), 30)\n", - "counts = [int((lr_probs >= t).sum()) for t in thresholds]\n", - "precs = [\n", - " float(y_test[lr_probs >= t].mean()) if (lr_probs >= t).sum() > 0 else 0.0 for t in thresholds\n", - "]\n", - "\n", - "fig, axes = plt.subplots(1, 2, figsize=(11, 4))\n", + "counts = []\n", + "precs = []\n", + "recalls = []\n", + "for t in thresholds:\n", + " m = lr_probs >= t\n", + " n_t = int(m.sum())\n", + " counts.append(n_t)\n", + " precs.append(float(y_test[m].mean()) if n_t > 0 else 0.0)\n", + " recalls.append(float(y_test[m].sum() / n_pos_test))\n", + "\n", + "fig, axes = plt.subplots(1, 3, figsize=(14, 4))\n", "axes[0].plot(thresholds, counts, marker=\"o\", color=\"#3b82f6\")\n", "axes[0].axhline(CAPACITY, color=\"#ef4444\", linestyle=\"--\", label=f\"capacity={CAPACITY}\")\n", "axes[0].axvline(threshold, color=\"#10b981\", linestyle=\"--\", label=f\"chosen ({threshold:.3f})\")\n", "axes[0].set_xlabel(\"threshold\")\n", "axes[0].set_ylabel(\"# leads above threshold\")\n", - "axes[0].set_title(\"Threshold sweep — count above\")\n", - "axes[0].legend()\n", + "axes[0].set_title(\"count above\")\n", + "axes[0].legend(fontsize=8)\n", "\n", "axes[1].plot(thresholds, precs, marker=\"o\", color=\"#3b82f6\")\n", "axes[1].axhline(base_rate, color=\"#9ca3af\", linestyle=\"--\", label=f\"base rate ({base_rate:.3f})\")\n", "axes[1].axvline(threshold, color=\"#10b981\", linestyle=\"--\", label=f\"chosen ({threshold:.3f})\")\n", "axes[1].set_xlabel(\"threshold\")\n", "axes[1].set_ylabel(\"precision above threshold\")\n", - "axes[1].set_title(\"Threshold sweep — precision above\")\n", - "axes[1].legend()\n", + "axes[1].set_title(\"precision above\")\n", + "axes[1].legend(fontsize=8)\n", + "\n", + "axes[2].plot(thresholds, recalls, marker=\"o\", color=\"#3b82f6\")\n", + "axes[2].axvline(threshold, color=\"#10b981\", linestyle=\"--\", label=f\"chosen ({threshold:.3f})\")\n", + "axes[2].set_xlabel(\"threshold\")\n", + "axes[2].set_ylabel(\"recall above threshold\")\n", + "axes[2].set_title(\"recall above\")\n", + "axes[2].legend(fontsize=8)\n", "plt.tight_layout()\n", "plt.show()" ] @@ -363,7 +388,7 @@ "cell_type": "markdown", "id": "cell_013", "metadata": {}, - "source": "## 7. Cohort-shift evaluation\n\nThe bundle's train/test split is a uniform random split of\nleads. A more realistic stress test is \"train on the first\n70 % of leads chronologically, score the last 30 % \"—\nbecause in production you always have to predict the\n*future*, never a held-out random sample of the past.\n\nWe mirror the validator's cohort-shift logic\n(`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`):\npool train + test, sort by `lead_created_at` with `lead_id`\nas a stable tiebreak, train HistGBM on the first 85 % and\nscore the last 15 % (the validator's `COHORT_TRAIN_FRAC`).\nBoth random and cohort splits use the full feature panel\n**including** the trap, matching the report's posture so\nthe numbers compare directly. The HistGBM uses\n`random_state=0` here (the validator's\n`DEFAULT_MODEL_RANDOM_STATE`) instead of the notebook's\ndefault `SEED=42` — that matters for the cohort-shift\nreproduction down to the third decimal.\n\nThe expected behaviour for the v1 intermediate tier is\n*no* degradation — the report shows the cohort split AUC\nrunning ~0.015 *higher* than the random split. That's a\nsurprise worth surfacing: the v1 simulator's intermediate\nworld doesn't drift over its 90-day horizon, so cohort\norder isn't a stressor here. The intro and advanced\ntiers show small positive degradations (intro +0.016,\nadvanced +0.010) — see\n`release/validation/validation_report.json` ⇒\n`cohort_shift`." + "source": "## 7. Cohort-shift evaluation\n\nThe bundle's train/test split is a uniform random split of\nleads. A more realistic stress test is \"train on the first\n85 % of leads chronologically, score the last 15 %\" —\nbecause in production you always have to predict the\n*future*, never a held-out random sample of the past.\n\nWe mirror the validator's cohort-shift logic\n(`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`)\nexactly: pool train + test, sort by `lead_created_at` with\n`lead_id` as a stable tiebreak, train HistGBM on the first\n85 % (`COHORT_TRAIN_FRAC = 0.85`) and score the last 15 %.\nBoth random and cohort splits use the full feature panel\n**including** the trap, matching the report's posture so\nthe numbers compare directly. The HistGBM uses\n`random_state=0` here (the validator's\n`DEFAULT_MODEL_RANDOM_STATE = 0`) rather than the\nnotebook's default `SEED=42` — the report's cohort-shift\nblock reproduces to four decimals only when both knobs\nmatch.\n\nThe expected behaviour for the v1 intermediate tier is\n*no* degradation — the report shows the cohort split AUC\nrunning ~0.015 *higher* than the random split. That's a\nsurprise worth surfacing: the v1 simulator's intermediate\nworld doesn't drift over its 90-day horizon, so cohort\norder isn't a stressor here. The intro and advanced\ntiers show small positive degradations (intro +0.016,\nadvanced +0.010) — see\n`release/validation/validation_report.json` ⇒\n`cohort_shift`." }, { "cell_type": "code", @@ -597,8 +622,8 @@ " \"lr_max_bin_err\": float(max_bin_err),\n", " \"lift_at_5pct\": lifts[5.0],\n", " \"lift_at_10pct\": lifts[10.0],\n", - " \"acv_cap_50\": acv_capture(lr_probs, 50),\n", - " \"acv_cap_100\": acv_capture(lr_probs, 100),\n", + " \"acv_cap_50\": acv_capture(False, 50),\n", + " \"acv_cap_100\": acv_capture(False, 100),\n", " \"boot_lr_auc_median\": float(np.nanmedian(boot_lr_auc)),\n", " \"boot_gbm_auc_median\": float(np.nanmedian(boot_gbm_auc)),\n", "}\n", @@ -609,12 +634,16 @@ " label=\"notebook 04 metric panel (seed 42, intermediate)\",\n", ")\n", "\n", - "# Sign-aware: value-aware ranking should not be worse than\n", - "# P-only ranking on aggregate. The headline finding stays\n", - "# in the narrative regardless of the exact numbers.\n", + "# Sign-aware: value-aware ranking should be measurably\n", + "# better, not just non-worse. On this seed every K shows\n", + "# +0.20 ACV-capture gain; the threshold sits well below\n", + "# that but well above noise so a regeneration that erodes\n", + "# the value-aware lift fails here.\n", + "MIN_VALUE_GAIN = 0.05\n", "for k, gain in value_gains.items():\n", - " assert gain >= -0.01, (\n", - " f\"value-aware ranking lost ground at top-{k} ({gain:+.4f}); \"\n", + " assert gain > MIN_VALUE_GAIN, (\n", + " f\"value-aware ranking lift at top-{k} collapsed: \"\n", + " f\"{gain:+.4f} <= {MIN_VALUE_GAIN:.4f} — \"\n", " \"the P×ACV story is no longer load-bearing\"\n", " )\n", "print(\"OK — cohort-shift, calibration, lift, value-capture, and bootstrap all pinned.\")" @@ -624,7 +653,7 @@ "cell_type": "markdown", "id": "cell_019", "metadata": {}, - "source": "## 10. Summary\n\n* The LR baseline is well-calibrated (max bin error ≈ 0.19\n on this seed) and lifts the top decile to ~2.6× the base\n rate.\n* Value-aware ranking (P × ACV) captures more revenue per\n top-K slot than P-only ranking — the gap depends on K\n but is positive across all sizes we tested.\n* Cohort shift is **negative** on the intermediate tier\n (the late cohort is *easier*, not harder); the report\n documents this, and the notebook reproduces it. The\n intro and advanced tiers show small positive\n degradations.\n* Bootstrap on the existing test split gives a within-\n bundle confidence band that's tighter than the cross-seed\n spread the validation report computes — useful for \"how\n confident is this single AUC\" questions, not for \"how\n much does the bundle move across seeds.\"\n\n## Where to go next\n\n1. Try cohort-shifted training in production: refit weekly\n on the trailing 60-day window, score the next 7 days.\n2. If you have real ACV data, swap the `expected_acv`\n heuristic for it and recompute section 5 — the revenue\n capture story should sharpen.\n3. The break-me playbook in `docs/release/break_me_guide.md`\n (coming in PR 6.3) catalogues additional stress tests\n (target-encoding leakage, train-test contamination,\n cohort-by-segment) and how to detect each from a\n single bundle." + "source": "## 10. Summary\n\n* The LR baseline is well-calibrated (max bin error ≈ 0.13\n on the trap-dropped headline panel, vs ~0.19 on the\n with-trap panel the validation report tracks) and lifts\n the top decile to ~2.75× the base rate.\n* Value-aware ranking (P × ACV) captures more revenue per\n top-K slot than P-only ranking — the gap depends on K\n but is positive across all sizes we tested.\n* Cohort shift is **negative** on the intermediate tier\n (the late cohort is *easier*, not harder); the report\n documents this, and the notebook reproduces it. The\n intro and advanced tiers show small positive\n degradations.\n* Bootstrap on the existing test split gives a within-\n bundle confidence band that's tighter than the cross-seed\n spread the validation report computes — useful for \"how\n confident is this single AUC\" questions, not for \"how\n much does the bundle move across seeds.\"\n\n## Where to go next\n\n1. Try cohort-shifted training in production: refit weekly\n on the trailing 60-day window, score the next 7 days.\n2. If you have real ACV data, swap the `expected_acv`\n heuristic for it and recompute section 5 — the revenue\n capture story should sharpen.\n3. The break-me playbook in `docs/release/break_me_guide.md`\n (coming in PR 6.3) catalogues additional stress tests\n (target-encoding leakage, train-test contamination,\n cohort-by-segment) and how to detect each from a\n single bundle." } ], "metadata": { diff --git a/scripts/build_release_notebook_03.py b/scripts/build_release_notebook_03.py index 3df45bb..e811af6 100644 --- a/scripts/build_release_notebook_03.py +++ b/scripts/build_release_notebook_03.py @@ -95,7 +95,7 @@ def cells() -> list[nbf.NotebookNode]: from sklearn.preprocessing import OneHotEncoder, StandardScaler sys.path.insert(0, str(Path.cwd())) - from _notebook_utils import assert_within_tolerance, precision_at_k + from _notebook_utils import assert_within_tolerance SEED = 42 BUNDLE = Path("../intermediate") # public student bundle @@ -108,7 +108,12 @@ def cells() -> list[nbf.NotebookNode]: assert manifest["relational_snapshot_safe"] is True SNAPSHOT_DAY = int(manifest["snapshot_day"]) HORIZON_DAYS = int(manifest["horizon_days"]) - print(f"snapshot_day = {SNAPSHOT_DAY} horizon_days = {HORIZON_DAYS}") + print(f"snapshot anchor: day {SNAPSHOT_DAY} of a {HORIZON_DAYS}-day horizon") + print( + f"any feature aggregating events past day {SNAPSHOT_DAY} " + f"is leaking — that's the whole {HORIZON_DAYS - SNAPSHOT_DAY}-day window " + "we're hunting in section 3" + ) """ ), md( @@ -200,9 +205,20 @@ def cells() -> list[nbf.NotebookNode]: f" → {n_post:,} of {len(window):,} leads " f"({n_post / len(window):.1%}) have a positive post-snapshot delta" ) - assert mean_delta > 0, ( - "expected a positive mean post-snapshot delta — if zero, the trap may " - "have been silently rebuilt as a snapshot-safe aggregate" + # Real gate, not performative: on the as-shipped bundle the + # mean delta is ~3.2 touches/lead and ~82 % of leads have a + # positive delta. The thresholds below sit well below those + # observations but well above "barely above zero" — a + # regeneration that erodes the trap's post-snapshot + # footprint will fail here even if a single lead still + # carries a positive delta. + assert mean_delta > 1.0, ( + f"mean post-snapshot delta collapsed to {mean_delta:.2f} (<= 1.0) — " + "the trap may have been silently rebuilt as a snapshot-safe aggregate" + ) + assert n_post / len(window) > 0.5, ( + f"only {n_post / len(window):.1%} of leads have a positive " + "post-snapshot delta (< 50 %); the trap's footprint has eroded" ) """ ), @@ -251,33 +267,47 @@ def cells() -> list[nbf.NotebookNode]: A common leakage audit is to fit a one-feature classifier on each suspect column and report the standalone AUC. The validation report does this at scale — its - `post_snapshot_aggregates` baseline trains a model on the - single column `total_touches_all` and reports an AUC around - 0.55. That sounds tame, and on a busy schedule it's tempting - to clear the column on those grounds. - - We re-run the probe here so you've seen the number with your - own eyes: ~0.53. If that's all you measure, the trap looks - barely worth mentioning. Section 5 shows what that audit - misses. + `post_snapshot_aggregates` baseline trains a *full LR + pipeline* (median-impute + StandardScaler + LR) on the + single column `total_touches_all` and reports an AUC of + ~0.55. We use a quicker probe here — the raw column + value as a score, no preprocessing — which gives ~0.53 on + this seed. The two numbers measure slightly different + things (a fitted LR can re-scale and adjust, a raw-value + ranker can't), but both fall in the "barely above chance" + band. On a busy schedule it's tempting to clear the column + on those grounds. Section 5 shows what that audit misses. """ ), code( """ - standalone_window = window.dropna(subset=[TRAP, "touch_count"]).copy() - y = standalone_window[TASK].astype(int).to_numpy() + # ``window`` was already dropped of NaN in section 3, so the + # raw-value ranker can use it directly. + y = window[TASK].astype(int).to_numpy() standalone = { - TRAP: float(roc_auc_score(y, standalone_window[TRAP].to_numpy())), + TRAP: float(roc_auc_score(y, window[TRAP].to_numpy())), "touch_count (snapshot-safe)": float( - roc_auc_score(y, standalone_window["touch_count"].to_numpy()) + roc_auc_score(y, window["touch_count"].to_numpy()) ), "post-snapshot delta": float( - roc_auc_score(y, standalone_window["post_snapshot_touches"].to_numpy()) + roc_auc_score(y, window["post_snapshot_touches"].to_numpy()) ), } print(f"{'feature':<32s} {'standalone AUC':>16s}") for name, auc in standalone.items(): print(f" {name:<30s} {auc:>16.4f}") + + # Sign-aware: the section-5 narrative ("standalone probe + # sees the trap as predictive-ish, but tree models extract + # more") falls apart if the trap drops to chance or below. + # Lower bound 0.51 sits just above sampling noise; if a + # regeneration ever puts the trap at or below 0.50, the + # whole pedagogical setup needs revisiting. + assert standalone[TRAP] > 0.51, ( + f"trap standalone AUC collapsed to {standalone[TRAP]:.3f} (<= 0.51); " + "section 5 contrasts the standalone probe with the GBM ablation, " + "and that contrast is empty if the trap is at or below chance" + ) """ ), md( @@ -368,8 +398,6 @@ def fit_score(cols: list[str], *, model: str) -> np.ndarray: results[model] = { "with_trap_auc": float(roc_auc_score(y_test, p_with)), "without_trap_auc": float(roc_auc_score(y_test, p_without)), - "with_trap_p100": precision_at_k(p_with, y_test, 100), - "without_trap_p100": precision_at_k(p_without, y_test, 100), } print(f"{'model':<5s} {'with trap':>10s} {'without trap':>13s} {'Δ AUC':>8s}") diff --git a/scripts/build_release_notebook_04.py b/scripts/build_release_notebook_04.py index 0c6dbb0..d9d87f9 100644 --- a/scripts/build_release_notebook_04.py +++ b/scripts/build_release_notebook_04.py @@ -358,30 +358,42 @@ def build_pipeline(num: list[str], cat: list[str], *, model: str) -> Pipeline: acv = pd.to_numeric(test["expected_acv"], errors="coerce").fillna(0.0).to_numpy() value_score = lr_probs * acv - def acv_capture(scores: np.ndarray, k: int) -> float: - order = np.argsort(-scores, kind="stable") - captured = float(np.sum(acv[order[:k]] * y_test[order[:k]])) - total = float(np.sum(acv * y_test)) - return captured / total if total > 0 else float("nan") + # Pre-compute the ranking orders once — argsort is O(N log N) + # and the order doesn't change as K varies, so the ~30 plot + # points below should not pay for ~30 sorts. + total_converted_acv = float(np.sum(acv * y_test)) + assert total_converted_acv > 0, "no converted-ACV in the test set" + order_p = np.argsort(-lr_probs, kind="stable") + order_v = np.argsort(-value_score, kind="stable") + captured_p = np.cumsum(acv[order_p] * y_test[order_p]) / total_converted_acv + captured_v = np.cumsum(acv[order_v] * y_test[order_v]) / total_converted_acv + + def acv_capture(use_value: bool, k: int) -> float: + # 1-indexed cumulative-capture lookup (k=1 = first slot). + series = captured_v if use_value else captured_p + if k <= 0 or k > len(series): + return float("nan") + return float(series[k - 1]) print(f"{'top-K':<6s} {'cap by P(conv)':>14s} {'cap by P×ACV':>13s} {'gain':>7s}") value_gains = {} for k in (50, 100, 200): - cap_p = acv_capture(lr_probs, k) - cap_v = acv_capture(value_score, k) + cap_p = acv_capture(False, k) + cap_v = acv_capture(True, k) value_gains[k] = cap_v - cap_p print(f" top {k:<3d} {cap_p:>14.4f} {cap_v:>13.4f} {cap_v - cap_p:+7.4f}") - # Plot side-by-side ACV capture for K in 10..300. + # Plot side-by-side ACV capture for K in 10..300. Cheap now + # — every point is a single cumulative-array lookup. ks = np.arange(10, 301, 10) - cap_p = [acv_capture(lr_probs, int(k)) for k in ks] - cap_v = [acv_capture(value_score, int(k)) for k in ks] + cap_p_curve = [acv_capture(False, int(k)) for k in ks] + cap_v_curve = [acv_capture(True, int(k)) for k in ks] fig, ax = plt.subplots(figsize=(7, 4)) - ax.plot(ks, cap_p, marker="o", color="#9ca3af", label="rank by P(convert)") - ax.plot(ks, cap_v, marker="o", color="#3b82f6", label="rank by P(convert)×ACV") + ax.plot(ks, cap_p_curve, marker="o", color="#9ca3af", label="rank by P(convert)") + ax.plot(ks, cap_v_curve, marker="o", color="#3b82f6", label="rank by P(convert)×ACV") ax.set_xlabel("top-K leads contacted") ax.set_ylabel("Fraction of converted-ACV captured") - ax.set_title("Value-aware ranking captures more revenue per outreach slot") + ax.set_title("ACV capture vs top-K (rank by P only vs P × ACV)") ax.legend(loc="lower right") plt.tight_layout() plt.show() @@ -397,14 +409,15 @@ def acv_capture(scores: np.ndarray, k: int) -> float: from the test population."* We sweep the probability threshold across the LR score - distribution and report precision / recall / count above - threshold for each step, then pick the threshold whose - count is closest to the requested capacity. + distribution and report **count, precision, and recall** + above threshold for each step, then pick the threshold + whose count is closest to the requested capacity. """ ), code( """ CAPACITY = 50 + n_pos_test = max(int(y_test.sum()), 1) sorted_probs = np.sort(lr_probs)[::-1] # The K-th highest probability is the smallest threshold that @@ -413,7 +426,7 @@ def acv_capture(scores: np.ndarray, k: int) -> float: mask = lr_probs >= threshold n_above = int(mask.sum()) prec = float(y_test[mask].mean()) if n_above > 0 else float("nan") - recall = float(y_test[mask].sum() / max(int(y_test.sum()), 1)) + recall = float(y_test[mask].sum() / n_pos_test) print( f"capacity={CAPACITY} threshold={threshold:.3f} " f"actually_above={n_above} precision={prec:.3f} recall={recall:.3f}" @@ -424,28 +437,39 @@ def acv_capture(scores: np.ndarray, k: int) -> float: thresholds = np.linspace( float(np.quantile(lr_probs, 0.5)), float(np.max(lr_probs)), 30 ) - counts = [int((lr_probs >= t).sum()) for t in thresholds] - precs = [ - float(y_test[lr_probs >= t].mean()) if (lr_probs >= t).sum() > 0 else 0.0 - for t in thresholds - ] - - fig, axes = plt.subplots(1, 2, figsize=(11, 4)) + counts = [] + precs = [] + recalls = [] + for t in thresholds: + m = lr_probs >= t + n_t = int(m.sum()) + counts.append(n_t) + precs.append(float(y_test[m].mean()) if n_t > 0 else 0.0) + recalls.append(float(y_test[m].sum() / n_pos_test)) + + fig, axes = plt.subplots(1, 3, figsize=(14, 4)) axes[0].plot(thresholds, counts, marker="o", color="#3b82f6") axes[0].axhline(CAPACITY, color="#ef4444", linestyle="--", label=f"capacity={CAPACITY}") axes[0].axvline(threshold, color="#10b981", linestyle="--", label=f"chosen ({threshold:.3f})") axes[0].set_xlabel("threshold") axes[0].set_ylabel("# leads above threshold") - axes[0].set_title("Threshold sweep — count above") - axes[0].legend() + axes[0].set_title("count above") + axes[0].legend(fontsize=8) axes[1].plot(thresholds, precs, marker="o", color="#3b82f6") axes[1].axhline(base_rate, color="#9ca3af", linestyle="--", label=f"base rate ({base_rate:.3f})") axes[1].axvline(threshold, color="#10b981", linestyle="--", label=f"chosen ({threshold:.3f})") axes[1].set_xlabel("threshold") axes[1].set_ylabel("precision above threshold") - axes[1].set_title("Threshold sweep — precision above") - axes[1].legend() + axes[1].set_title("precision above") + axes[1].legend(fontsize=8) + + axes[2].plot(thresholds, recalls, marker="o", color="#3b82f6") + axes[2].axvline(threshold, color="#10b981", linestyle="--", label=f"chosen ({threshold:.3f})") + axes[2].set_xlabel("threshold") + axes[2].set_ylabel("recall above threshold") + axes[2].set_title("recall above") + axes[2].legend(fontsize=8) plt.tight_layout() plt.show() """ @@ -456,22 +480,23 @@ def acv_capture(scores: np.ndarray, k: int) -> float: The bundle's train/test split is a uniform random split of leads. A more realistic stress test is "train on the first - 70 % of leads chronologically, score the last 30 % "— + 85 % of leads chronologically, score the last 15 %" — because in production you always have to predict the *future*, never a held-out random sample of the past. We mirror the validator's cohort-shift logic - (`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`): - pool train + test, sort by `lead_created_at` with `lead_id` - as a stable tiebreak, train HistGBM on the first 85 % and - score the last 15 % (the validator's `COHORT_TRAIN_FRAC`). + (`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`) + exactly: pool train + test, sort by `lead_created_at` with + `lead_id` as a stable tiebreak, train HistGBM on the first + 85 % (`COHORT_TRAIN_FRAC = 0.85`) and score the last 15 %. Both random and cohort splits use the full feature panel **including** the trap, matching the report's posture so the numbers compare directly. The HistGBM uses `random_state=0` here (the validator's - `DEFAULT_MODEL_RANDOM_STATE`) instead of the notebook's - default `SEED=42` — that matters for the cohort-shift - reproduction down to the third decimal. + `DEFAULT_MODEL_RANDOM_STATE = 0`) rather than the + notebook's default `SEED=42` — the report's cohort-shift + block reproduces to four decimals only when both knobs + match. The expected behaviour for the v1 intermediate tier is *no* degradation — the report shows the cohort split AUC @@ -749,8 +774,8 @@ def _summary(arr: np.ndarray, name: str) -> None: "lr_max_bin_err": float(max_bin_err), "lift_at_5pct": lifts[5.0], "lift_at_10pct": lifts[10.0], - "acv_cap_50": acv_capture(lr_probs, 50), - "acv_cap_100": acv_capture(lr_probs, 100), + "acv_cap_50": acv_capture(False, 50), + "acv_cap_100": acv_capture(False, 100), "boot_lr_auc_median": float(np.nanmedian(boot_lr_auc)), "boot_gbm_auc_median": float(np.nanmedian(boot_gbm_auc)), } @@ -761,12 +786,16 @@ def _summary(arr: np.ndarray, name: str) -> None: label="notebook 04 metric panel (seed 42, intermediate)", ) - # Sign-aware: value-aware ranking should not be worse than - # P-only ranking on aggregate. The headline finding stays - # in the narrative regardless of the exact numbers. + # Sign-aware: value-aware ranking should be measurably + # better, not just non-worse. On this seed every K shows + # +0.20 ACV-capture gain; the threshold sits well below + # that but well above noise so a regeneration that erodes + # the value-aware lift fails here. + MIN_VALUE_GAIN = 0.05 for k, gain in value_gains.items(): - assert gain >= -0.01, ( - f"value-aware ranking lost ground at top-{k} ({gain:+.4f}); " + assert gain > MIN_VALUE_GAIN, ( + f"value-aware ranking lift at top-{k} collapsed: " + f"{gain:+.4f} <= {MIN_VALUE_GAIN:.4f} — " "the P×ACV story is no longer load-bearing" ) print("OK — cohort-shift, calibration, lift, value-capture, and bootstrap all pinned.") @@ -776,9 +805,10 @@ def _summary(arr: np.ndarray, name: str) -> None: """ ## 10. Summary - * The LR baseline is well-calibrated (max bin error ≈ 0.19 - on this seed) and lifts the top decile to ~2.6× the base - rate. + * The LR baseline is well-calibrated (max bin error ≈ 0.13 + on the trap-dropped headline panel, vs ~0.19 on the + with-trap panel the validation report tracks) and lifts + the top decile to ~2.75× the base rate. * Value-aware ranking (P × ACV) captures more revenue per top-K slot than P-only ranking — the gap depends on K but is positive across all sizes we tested. diff --git a/tests/release/notebooks/test_release_targets_match_report.py b/tests/release/notebooks/test_release_targets_match_report.py index a94619a..f5a35a2 100644 --- a/tests/release/notebooks/test_release_targets_match_report.py +++ b/tests/release/notebooks/test_release_targets_match_report.py @@ -55,17 +55,28 @@ def test_cohort_shift_targets_match_validation_report() -> None: shift metrics under a top-level ``cohort_shift.`` key (single seed, not a cross-seed median), so the structure differs from the per-tier ``medians`` block above and warrants its own audit loop. + + The block is **required**: notebook 04's tolerance gate reads it + directly, and silently allowing it to disappear would defeat the + audit-sync invariant. If notebook 04 ever stops needing this + block, the test should be deleted, not bypassed. """ targets = json.loads(_TARGETS_PATH.read_text()) - cohort_targets = targets.get("cohort_shift", {}) - if not cohort_targets: - return # absent block is permitted; only the contents need to match + assert "cohort_shift" in targets, ( + "release_targets is missing the 'cohort_shift' block that notebook 04 " + "reads at runtime — re-add it (sourced from " + "validation_report.cohort_shift) or delete this test if the notebook " + "no longer needs it" + ) + cohort_targets = targets["cohort_shift"] report = json.loads(_REPORT_PATH.read_text()) report_cohort = report["cohort_shift"] + tiers_checked = 0 for tier_name, tier_metrics in cohort_targets.items(): if tier_name.startswith("_"): continue + tiers_checked += 1 assert tier_name in report_cohort, ( f"targets cohort_shift mentions tier {tier_name!r} which is absent " f"from validation_report.json cohort_shift (known: {list(report_cohort)})" @@ -81,3 +92,7 @@ def test_cohort_shift_targets_match_validation_report() -> None: f"{target_value} but validation_report has {report_block[metric_name]} — " "regenerate the report or update _release_targets.json" ) + assert tiers_checked > 0, ( + "cohort_shift block contained only meta keys (none starting without an " + "underscore) — at least one tier must be pinned" + ) From 397feb544b3314f235a58cd5bb73bed480ba652f Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Thu, 7 May 2026 13:45:32 +0300 Subject: [PATCH 3/3] PR 6.2 Copilot review pass: fold all four real findings back in MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Copilot's review of the PR 6.2 self-review surfaced four still-real findings (two others were marked outdated — already addressed by the self-review pass). Folded all four back in before requesting human review. - (COPILOT-1) Grammar: "reflect a honest production model" → "reflect an honest production model" in NB04 §2 narrative. - (COPILOT-3) The threshold-selection comment claimed the K-th highest probability + ``mask = lr_probs >= threshold`` admits *exactly* K leads, "ties resolved by score order". The inclusive comparison can admit more than K when leads share the threshold's probability — and there is no implicit tie-break. Rewrote the comment to be honest about the semantics ("admits AT LEAST K via probs >= threshold; ties at the threshold can inflate the slate; ``actually_above`` makes the realised count visible"). Kept the threshold-based selection rather than switching to a true top-K via ``argsort`` because the pedagogical point of section 6 is *threshold selection*, not *rank cutoff*. - (COPILOT-4) Bootstrap loop comment said "Degenerate resample — re-roll" but the implementation writes NaN and continues. Rewrote the comment to match what the code does (mark NaN, let ``_summary`` filter it out) and added the actual probability bound — with n_test=750 and base rate ~22 %, the all-positive or all-negative draw probability is ~10⁻¹⁰⁰, so the branch is dead in practice and exists only as a defensive safety net for tiny test sets. Implementing a real re-roll loop would never execute on this dataset. - (COPILOT-6) The top-level ``_doc`` in ``release/notebooks/_release_targets.json`` claimed the file contains only "cross-seed-median metric values", which became inaccurate after the PR 6.2 cohort_shift block (single-seed, seed 42). Rewrote the docstring to call out the mixed structure: per-tier blocks hold cross-seed medians; cohort_shift block holds single-seed values from ``validation_report.cohort_shift``. Audit-sync test continues to enforce both cases via separate loops. - (COPILOT-2, COPILOT-5) outdated; already addressed by the self-review fix-up commit (25b9ec1) — the 70/30-vs-85/15 narrative whiplash and the wrong summary numbers (0.19 / 2.6× vs the actual 0.13 / 2.75×) are both fixed there. Resolved as already-treated. Net: 28/28 notebook builder + execution + audit-sync tests pass; ruff + mypy clean; both notebooks still execute end-to-end in <10 s each. Co-Authored-By: Claude Opus 4.7 --- .../04_lift_calibration_value_ranking.ipynb | 19 ++++++++++++++++--- release/notebooks/_release_targets.json | 2 +- scripts/build_release_notebook_04.py | 19 ++++++++++++++++--- 3 files changed, 33 insertions(+), 7 deletions(-) diff --git a/release/notebooks/04_lift_calibration_value_ranking.ipynb b/release/notebooks/04_lift_calibration_value_ranking.ipynb index b3eaf40..25933be 100644 --- a/release/notebooks/04_lift_calibration_value_ranking.ipynb +++ b/release/notebooks/04_lift_calibration_value_ranking.ipynb @@ -63,7 +63,7 @@ "cell_type": "markdown", "id": "cell_003", "metadata": {}, - "source": "## 2. Train the headline LR + GBM panel\n\nSame preprocessing as notebooks 01 / 02 (mirrors\n`leadforge.validation.release_quality._build_pipeline`).\nWe drop the documented leakage trap `total_touches_all`\nhere so the calibration / lift / value plots in sections\n3–6 reflect a honest production model. The cohort-shift\nsection in section 7 uses the validator's full-panel\nposture (trap kept) so its number is comparable to the\npublished validation report." + "source": "## 2. Train the headline LR + GBM panel\n\nSame preprocessing as notebooks 01 / 02 (mirrors\n`leadforge.validation.release_quality._build_pipeline`).\nWe drop the documented leakage trap `total_touches_all`\nhere so the calibration / lift / value plots in sections\n3–6 reflect an honest production model. The cohort-shift\nsection in section 7 uses the validator's full-panel\nposture (trap kept) so its number is comparable to the\npublished validation report." }, { "cell_type": "code", @@ -333,7 +333,14 @@ "\n", "sorted_probs = np.sort(lr_probs)[::-1]\n", "# The K-th highest probability is the smallest threshold that\n", - "# admits exactly K leads (ties resolved by score order).\n", + "# admits AT LEAST K leads via ``probs >= threshold``. If\n", + "# several leads share that probability, the inclusive\n", + "# comparison can admit more than K — that's a property of\n", + "# threshold-based selection, not a bug. The\n", + "# ``actually_above`` readout below makes the realised count\n", + "# visible so the operator can see when ties are inflating\n", + "# the slate (and decide whether to break them with a\n", + "# secondary score).\n", "threshold = float(sorted_probs[CAPACITY - 1])\n", "mask = lr_probs >= threshold\n", "n_above = int(mask.sum())\n", @@ -508,7 +515,13 @@ "for i in range(N_BOOT):\n", " idx = rng.integers(0, n_test, n_test)\n", " if y_test[idx].sum() == 0 or y_test[idx].sum() == n_test:\n", - " # Degenerate resample — re-roll.\n", + " # Degenerate resample (all-positive or all-negative)\n", + " # — ``roc_auc_score`` is undefined here. We mark\n", + " # the iteration NaN and let ``_summary`` filter it\n", + " # out; with n_test=750 and base rate ~22 %, the\n", + " # probability of a degenerate draw is ~10⁻¹⁰⁰, so\n", + " # this branch is dead in practice. Kept as a\n", + " # defensive safety net for tiny test sets.\n", " boot_lr_auc[i] = np.nan\n", " boot_gbm_auc[i] = np.nan\n", " boot_lr_ap[i] = np.nan\n", diff --git a/release/notebooks/_release_targets.json b/release/notebooks/_release_targets.json index 9d36f7c..e455d02 100644 --- a/release/notebooks/_release_targets.json +++ b/release/notebooks/_release_targets.json @@ -1,5 +1,5 @@ { - "_doc": "Cross-seed-median metric values from release/validation/validation_report.json, sliced to the metrics the release notebooks pin via assert_within_tolerance. Audited against the report by tests/release/notebooks/test_release_targets_match_report.py — if you change a value here, the test will fail unless the corresponding median in the validation report changes to match.", + "_doc": "Reproduction targets that release notebooks pin via assert_within_tolerance, sourced from release/validation/validation_report.json. Mixed structure: per-tier blocks (intermediate, ...) hold cross-seed-median metrics from validation_report.tiers..medians; the cohort_shift block holds single-seed (seed 42) metrics from validation_report.cohort_shift., since the report runs cohort-shift on seed 42 only. Audited against the report by tests/release/notebooks/test_release_targets_match_report.py — if you change a value here, the test will fail unless the corresponding source value in the validation report changes to match.", "cohort_shift": { "_doc": "Per-tier cohort-shift metrics from validation_report.cohort_shift (single-seed values; the report runs cohort-shift only on seed 42). Notebook 04 reproduces these via a chronological resplit and pins them via assert_within_tolerance.", "intermediate": { diff --git a/scripts/build_release_notebook_04.py b/scripts/build_release_notebook_04.py index d9d87f9..f57bbf8 100644 --- a/scripts/build_release_notebook_04.py +++ b/scripts/build_release_notebook_04.py @@ -128,7 +128,7 @@ def cells() -> list[nbf.NotebookNode]: `leadforge.validation.release_quality._build_pipeline`). We drop the documented leakage trap `total_touches_all` here so the calibration / lift / value plots in sections - 3–6 reflect a honest production model. The cohort-shift + 3–6 reflect an honest production model. The cohort-shift section in section 7 uses the validator's full-panel posture (trap kept) so its number is comparable to the published validation report. @@ -421,7 +421,14 @@ def acv_capture(use_value: bool, k: int) -> float: sorted_probs = np.sort(lr_probs)[::-1] # The K-th highest probability is the smallest threshold that - # admits exactly K leads (ties resolved by score order). + # admits AT LEAST K leads via ``probs >= threshold``. If + # several leads share that probability, the inclusive + # comparison can admit more than K — that's a property of + # threshold-based selection, not a bug. The + # ``actually_above`` readout below makes the realised count + # visible so the operator can see when ties are inflating + # the slate (and decide whether to break them with a + # secondary score). threshold = float(sorted_probs[CAPACITY - 1]) mask = lr_probs >= threshold n_above = int(mask.sum()) @@ -644,7 +651,13 @@ def _gbm_pipeline_for_cohort() -> Pipeline: for i in range(N_BOOT): idx = rng.integers(0, n_test, n_test) if y_test[idx].sum() == 0 or y_test[idx].sum() == n_test: - # Degenerate resample — re-roll. + # Degenerate resample (all-positive or all-negative) + # — ``roc_auc_score`` is undefined here. We mark + # the iteration NaN and let ``_summary`` filter it + # out; with n_test=750 and base rate ~22 %, the + # probability of a degenerate draw is ~10⁻¹⁰⁰, so + # this branch is dead in practice. Kept as a + # defensive safety net for tiny test sets. boot_lr_auc[i] = np.nan boot_gbm_auc[i] = np.nan boot_lr_ap[i] = np.nan