From 88c1c8f0bdaabbf2e9b52d897af0741c498f6c4e Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Thu, 7 May 2026 12:07:31 +0300
Subject: [PATCH 1/3] PR 6.2: notebooks 03 (leakage + time windows) + 04 (lift
 / calibration / value / cohort) + bootstrap robustness harness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds the second pair of release notebooks: a leakage walkthrough that
turns the documented `total_touches_all` trap into a teachable
contrast, and a value-aware ranking notebook that covers calibration,
lift, expected_acv-aware top-K policies, threshold selection, cohort
shift, and a within-bundle bootstrap.

Notebook 03 — leakage and time windows
  - Reads the trap label off `feature_dictionary.csv`; proves the
    trap by construction via a same-table comparison of
    `total_touches_all` (full-horizon) vs `touch_count`
    (snapshot-safe). Mean post-snapshot delta is 3.2 touches/lead;
    82 % of leads have a positive delta.
  - Standalone-AUC probe on the trap (~0.53 — looks innocuous) ⇒
    full-panel ± trap ablation showing HistGBM extracts +0.032 AUC
    where LR only squeezes +0.009. Pedagogical headline:
    *standalone AUC probes undersell tree-friendly leakage*.
  - Sign-aware tolerance gate pins each AUC ±0.02 and asserts
    `gbm_lift > 0.015` so a future regeneration that erases or
    amplifies the trap breaks CI.

Notebook 04 — lift, calibration, value, cohort, bootstrap
  - Calibration / reliability diagram (LR max bin error ≈ 0.13).
  - Lift @ 1/5/10 % + cumulative gains curve.
  - `expected_acv × P(convert)` value-aware ranking: top-50
    ACV-capture jumps from 0.16 to 0.40 vs P-only ranking.
  - Threshold selection for fixed top-K capacity (e.g. 50/week).
  - Cohort-shift evaluation that **reproduces the validation
    report's `cohort_shift.intermediate` block exactly** (0.8754 /
    0.8908 / −0.0155). Mirrors `release_quality.measure_cohort_
    shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and
    `model_random_state=0`.
  - 200-iter bootstrap of the test set as the within-bundle
    confidence band — explicitly framed as the proxy for true
    cross-seed sweep, since public-bundle consumers can't rebuild
    bundles without `leadforge` installed.
  - Headline LR/GBM panel drops the trap (matches notebook 02);
    cohort-shift section keeps the trap to reproduce the report.

Audit-sync wiring
  - `release/notebooks/_release_targets.json` gains a
    `cohort_shift.intermediate` block sourced from
    `validation_report.cohort_shift.intermediate`.
  - `tests/release/notebooks/test_release_targets_match_report.py`
    adds `test_cohort_shift_targets_match_validation_report` to
    audit-sync the new block. Existing test now skips the
    `cohort_shift` key during the per-tier-medians loop.

Builders + tests
  - `scripts/build_release_notebook_{03,04}.py` inherit the
    deterministic-cell-ID + `--out` byte-stability pattern from
    PR 6.1.
  - Added to `_BUILDERS` / `_NOTEBOOKS` in
    `tests/scripts/test_release_notebook_builders.py` and
    `tests/release/notebooks/test_execute_notebooks.py`.
  - Both notebooks execute end-to-end in <10s each (well under
    G13.1's 3-min budget), assert
    `manifest.exposure_mode == "student_public"` (G13.3), and
    load only from `release/intermediate/`.

Forward-pointer to `docs/release/break_me_guide.md` left as plain
backtick-wrapped text; file lands in PR 6.3.

`.agent-plan.md` Phase 6 entry updated to mark PR 6.2 complete.

Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff +
mypy clean; leakage probes 0/3 on every tier; hash determinism
PASS 67/67; `validate_release_candidate --no-rebuild` exits 0;
`BUNDLE_SCHEMA_VERSION` unchanged at 5.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .agent-plan.md                                |   2 +-
 .../03_leakage_and_time_windows.ipynb         | 394 +++++++++
 .../04_lift_calibration_value_ranking.ipynb   | 643 ++++++++++++++
 release/notebooks/_release_targets.json       |   8 +
 scripts/build_release_notebook_03.py          | 542 ++++++++++++
 scripts/build_release_notebook_04.py          | 822 ++++++++++++++++++
 .../notebooks/test_execute_notebooks.py       |   2 +
 .../test_release_targets_match_report.py      |  43 +-
 .../scripts/test_release_notebook_builders.py |   2 +
 9 files changed, 2455 insertions(+), 3 deletions(-)
 create mode 100644 release/notebooks/03_leakage_and_time_windows.ipynb
 create mode 100644 release/notebooks/04_lift_calibration_value_ranking.ipynb
 create mode 100644 scripts/build_release_notebook_03.py
 create mode 100644 scripts/build_release_notebook_04.py

diff --git a/.agent-plan.md b/.agent-plan.md
index f715f35..99821f3 100644
--- a/.agent-plan.md
+++ b/.agent-plan.md
@@ -59,7 +59,7 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
 
 ### Phase 6 — Notebook sequence + adversarial framing
 - [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added.  Notebook 01 trains LR + HistGBM on the public `intermediate` bundle using the **same feature set as the validation report** (drops only IDs and the label, mirrors `release_quality._partition_columns`), so the G13.2 reproduction gate compares apples to apples.  This means notebook 01 **keeps** `total_touches_all` (the documented leakage trap) — narrative cell calls it out explicitly and forward-points to notebook 03 (PR 6.2) which dissects what dropping the trap does to performance.  Notebook 02 by contrast **drops** the trap from the flat baseline so the relational lift attribution stays clean (its goal is teaching feature engineering, not reproducing the report).  Targets are loaded at runtime from `release/notebooks/_release_targets.json` (audit-synced against `release/validation/validation_report.json` by `tests/release/notebooks/test_release_targets_match_report.py`); per-metric tolerances replace the original flat ±0.05 (AUC/Brier ±0.02, AP / top-decile ±0.05).  Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on **train leads only**, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, prints a 4-row metric panel + delta panel, and pins the four model AUCs and the headline `GBM(eng) − GBM(flat)` lift via `assert_within_tolerance` (sign-aware `assert lift > 0` on top of the absolute tolerance).  Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive (the cross-seed `gbm_auc` spread on this bundle is ~0.027); seed-sweep harness lands in PR 6.2's notebook 04.  Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` and `top_decile_rate` mirror `release_quality._precision_at_k` / `_top_decile_rate` (locked in by mirror tests), and `assert_within_tolerance` is hardened against silent passes on non-finite metrics or incomplete per-metric tolerance maps.  G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the intermediate bundle via `python scripts/build_public_release.py release --tier intermediate` (only tier the notebooks need) then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised, gated on bundles-present).  G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`.  Builders (`scripts/build_release_notebook_{01,02}.py`, sharing `scripts/_release_notebook_common.py`) emit deterministic byte-for-byte notebook JSON via explicit `cell_NNN` IDs (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2, locked in by `tests/scripts/test_release_notebook_builders.py` which builds twice into `tmp_path` via the new `--out PATH` flag and diffs against the committed file without ever touching the working tree) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree.  Net: 1250/1250 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
-- [ ] `release/notebooks/{03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb`
+- [x] PR 6.2: `release/notebooks/03_leakage_and_time_windows.ipynb` and `release/notebooks/04_lift_calibration_value_ranking.ipynb` added.  Notebook 03 turns the documented `total_touches_all` trap into a teaching moment: reads the trap label off `feature_dictionary.csv`, proves the trap by construction via a same-table comparison of `total_touches_all` (full-horizon) vs `touch_count` (snapshot-safe) — the post-snapshot delta sums to ~3.2 touches/lead and 82 % of leads have a positive delta — and then runs a standalone-AUC probe on the trap (~0.53 AUC, looks innocuous) followed by a side-by-side full-panel ± trap ablation that shows HistGBM extracts ~+0.032 AUC from the same column LR can only squeeze ~+0.009 from.  The reframed pedagogy (vs the prompt's original "trap dominates a thin firmographic set" framing) is empirically driven: firmographic-only is at chance AUC even with the trap, but the GBM-vs-LR asymmetry on the strong panel is a real and useful finding — *standalone AUC probes undersell tree-friendly leakage*.  Sign-aware tolerance gate pins each AUC ±0.02 and asserts `gbm_lift > 0.015` so a future regeneration that erases the trap or accidentally amplifies it breaks CI.  Notebook 04 covers the four extra ranking lenses AUC alone misses: calibration / reliability diagram (max bin error ≈ 0.13), lift + cumulative gains (top-decile lift 2.75×), value-aware ranking via `expected_acv × P(convert)` (top-50 ACV-capture jumps from 0.16 to 0.40), threshold selection for fixed top-K capacity, cohort-shift evaluation (HistGBM on the first 85 % chronologically → score the last 15 %, mirrors `release_quality.measure_cohort_shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and `model_random_state=0`, **reproduces the report's `cohort_shift.intermediate` block exactly**: 0.8754 / 0.8908 / −0.0155), and a 200-iter bootstrap of the test-set AUC/AP as the within-bundle confidence band that public-bundle consumers (Kaggle / HF) can run without `leadforge` installed (the prompt's "seed-sweep harness" with bootstrap honestly acknowledged as the proxy for true cross-seed sweep, since rebuilding bundles isn't an option for downstream users).  Cohort-shift values are pinned via a new `cohort_shift.intermediate` block in `release/notebooks/_release_targets.json` (audit-synced against `validation_report.cohort_shift.intermediate` by a new `test_cohort_shift_targets_match_validation_report` extension to the existing audit-sync test).  Headline LR/GBM panel **drops** `total_touches_all` (matches notebook 02's posture, gives honest production numbers); cohort-shift section deliberately **keeps** the trap to reproduce the report's published cohort-shift numbers exactly — divergent posture explained inline.  Both new builders (`scripts/build_release_notebook_{03,04}.py`) inherit the deterministic-cell-ID + `--out` byte-stability pattern from PR 6.1 and are added to `_BUILDERS` / `_NOTEBOOKS` in `tests/scripts/test_release_notebook_builders.py` and `tests/release/notebooks/test_execute_notebooks.py`.  Both notebooks execute end-to-end in <10s each (well under G13.1's 3-min budget), assert `manifest.exposure_mode == "student_public"` (G13.3), and load only from `release/intermediate/`.  Forward-pointer to `docs/release/break_me_guide.md` left as plain backtick-wrapped text — file lands in PR 6.3, no dead Markdown link.  Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
 - [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml`
 - [ ] `docs/release/{break_me_guide,v2_decision_log}.md`
 
diff --git a/release/notebooks/03_leakage_and_time_windows.ipynb b/release/notebooks/03_leakage_and_time_windows.ipynb
new file mode 100644
index 0000000..0be6d36
--- /dev/null
+++ b/release/notebooks/03_leakage_and_time_windows.ipynb
@@ -0,0 +1,394 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "cell_000",
+   "metadata": {},
+   "source": "# Notebook 03 — Leakage and Time Windows\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.\n\nThe bundle ships with one **deliberate leakage trap**:\n`total_touches_all`. The feature dictionary marks it\n`leakage_risk = True`; the dataset card calls it out;\nnotebook 01 keeps it (matching the validation report's\npanel) while notebook 02 drops it. This notebook turns the\ntrap into a teaching moment.\n\nWe do four things:\n\n1. **Read the receipts.** The trap is documented in the\n   feature dictionary. We surface that label.\n2. **Time-window proof.** Quantify how much\n   `total_touches_all` differs from its snapshot-safe\n   sibling `touch_count` — the difference is post-snapshot\n   information by construction, regardless of how\n   predictive that information turns out to be.\n3. **The lesson.** Run a single-column standalone-AUC probe\n   on the trap (it looks innocuous, ~0.53). Then run the\n   full-panel ± trap comparison: HistGBM extracts a\n   substantial AUC lift (+0.03) from a column whose\n   standalone AUC is barely above chance. Standalone\n   probes undersell tree-friendly leakage.\n4. **Pin the deltas.** Sign-aware tolerance gates so a\n   future regeneration that neutralises the trap (or\n   accidentally amplifies it) breaks CI.\n\n**Public path discipline (G13.3).** This notebook reads only\nfrom `release/intermediate/` (the public student bundle). The\ninstructor companion is **not** loaded — leakage detection\nhas to work from the public artefact alone, since that's all\na downstream consumer ever has."
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_001",
+   "metadata": {},
+   "source": "## 1. Setup"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_002",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from __future__ import annotations\n",
+    "\n",
+    "import json\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from sklearn.compose import ColumnTransformer\n",
+    "from sklearn.ensemble import HistGradientBoostingClassifier\n",
+    "from sklearn.impute import SimpleImputer\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.metrics import roc_auc_score\n",
+    "from sklearn.pipeline import Pipeline\n",
+    "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n",
+    "\n",
+    "sys.path.insert(0, str(Path.cwd()))\n",
+    "from _notebook_utils import assert_within_tolerance, precision_at_k\n",
+    "\n",
+    "SEED = 42\n",
+    "BUNDLE = Path(\"../intermediate\")  # public student bundle\n",
+    "TASK = \"converted_within_90_days\"\n",
+    "TRAP = \"total_touches_all\"\n",
+    "\n",
+    "with (BUNDLE / \"manifest.json\").open() as fh:\n",
+    "    manifest = json.load(fh)\n",
+    "assert manifest[\"exposure_mode\"] == \"student_public\"\n",
+    "assert manifest[\"relational_snapshot_safe\"] is True\n",
+    "SNAPSHOT_DAY = int(manifest[\"snapshot_day\"])\n",
+    "HORIZON_DAYS = int(manifest[\"horizon_days\"])\n",
+    "print(f\"snapshot_day = {SNAPSHOT_DAY}   horizon_days = {HORIZON_DAYS}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_003",
+   "metadata": {},
+   "source": "## 2. The trap, as the feature dictionary calls it out\n\nThe release ships a `feature_dictionary.csv` next to the\ndata. Any column with `leakage_risk = True` is flagged as a\n**deliberate teaching trap** — included so users can practise\ndetecting it, with the trap's nature documented inline.\n\nTreat the feature dictionary as the first place you look on\nany new dataset. A column named `total_touches_all` is not\nobviously bad until the dictionary tells you it counts\ntouches over the full 90-day horizon, well past the\n30-day snapshot anchor that defines the prediction time."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_004",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "feat_dict = pd.read_csv(BUNDLE / \"feature_dictionary.csv\")\n",
+    "traps = feat_dict[feat_dict[\"leakage_risk\"].astype(bool)]\n",
+    "print(f\"trap columns flagged in feature_dictionary.csv: {len(traps)}\")\n",
+    "for _, row in traps.iterrows():\n",
+    "    print(f\"  {row['name']}: {row['description']}\")\n",
+    "assert TRAP in set(traps[\"name\"]), f\"{TRAP} expected to be flagged in dictionary\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_005",
+   "metadata": {},
+   "source": "## 3. Time-window proof — the trap *by construction*\n\nThe dictionary *says* `total_touches_all` uses post-snapshot\ndata. We verify that on the same row that carries the trap:\nthe task table also carries `touch_count`, the\n**snapshot-safe** touch aggregate (filtered to\n`touch_timestamp <= lead_created_at + snapshot_day`). Their\ndifference is the **post-snapshot delta** — by construction,\ninformation from days 31–90 that the model should never see\nwhen scoring at day 30.\n\nThe pedagogical point is independent of how predictive that\ndifference turns out to be. **A column that uses\npost-snapshot data is invalid at scoring time even when it\nlooks unpredictive in isolation.** Section 4 measures that\n\"looks unpredictive in isolation\" claim directly, then\nsection 5 shows it can be misleading.\n\nWe pool all three task splits so the receipt covers every\nlead in the bundle."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_006",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"train.parquet\")\n",
+    "valid = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"valid.parquet\")\n",
+    "test = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"test.parquet\")\n",
+    "\n",
+    "all_leads = pd.concat([train, valid, test], ignore_index=True)\n",
+    "assert all_leads[\"lead_id\"].is_unique, \"expected one row per lead across train/valid/test\"\n",
+    "\n",
+    "window = all_leads[[\"lead_id\", TRAP, \"touch_count\", TASK]].copy()\n",
+    "window[TRAP] = pd.to_numeric(window[TRAP], errors=\"coerce\")\n",
+    "window[\"touch_count\"] = pd.to_numeric(window[\"touch_count\"], errors=\"coerce\")\n",
+    "window = window.dropna(subset=[TRAP, \"touch_count\"]).copy()\n",
+    "window[\"post_snapshot_touches\"] = window[TRAP] - window[\"touch_count\"]\n",
+    "window[TASK] = window[TASK].astype(\"boolean\").fillna(False).astype(int)\n",
+    "\n",
+    "print(f\"leads used in this section: {len(window):,}\")\n",
+    "print(f\"  {TRAP:<22s}      mean={window[TRAP].mean():6.2f}  max={int(window[TRAP].max()):>4d}\")\n",
+    "print(\n",
+    "    f\"  {'touch_count (snapshot-safe)':<22s}  \"\n",
+    "    f\"mean={window['touch_count'].mean():6.2f}  \"\n",
+    "    f\"max={int(window['touch_count'].max()):>4d}\"\n",
+    ")\n",
+    "mean_delta = float(window[\"post_snapshot_touches\"].mean())\n",
+    "n_post = int((window[\"post_snapshot_touches\"] > 0).sum())\n",
+    "print(\n",
+    "    f\"  {'post-snapshot delta':<22s}      \"\n",
+    "    f\"mean={mean_delta:6.2f}  \"\n",
+    "    f\"max={int(window['post_snapshot_touches'].max()):>4d}\"\n",
+    ")\n",
+    "print(\n",
+    "    f\"  → {n_post:,} of {len(window):,} leads \"\n",
+    "    f\"({n_post / len(window):.1%}) have a positive post-snapshot delta\"\n",
+    ")\n",
+    "assert mean_delta > 0, (\n",
+    "    \"expected a positive mean post-snapshot delta — if zero, the trap may \"\n",
+    "    \"have been silently rebuilt as a snapshot-safe aggregate\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_007",
+   "metadata": {},
+   "source": "### 3.1 The post-snapshot delta is uncorrelated with the label *on this dataset*\n\nOn the v1 procurement world, the count of touches between\nday 30 and day 90 turns out to be roughly the same for\nconverted and non-converted leads — sales reps keep working\nboth groups for a while before the funnel settles. A\nstronger world (more aggressive sales follow-up on hot\nleads) would split these apart; this one doesn't.\n\nThe plot below makes that lack-of-split visible. The trap\nis *still a trap* — we just can't tell that from the\npost-snapshot delta alone, which is why the validation\nreport's `post_snapshot_aggregates` baseline (a single-\ncolumn probe) gives an AUC of only ~0.55. The real damage\nshows up when a tree model gets to combine the trap with\nother columns; section 5 measures that."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_008",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "grouped = window.groupby(TASK)[\"post_snapshot_touches\"].agg([\"mean\", \"median\", \"count\"])\n",
+    "grouped.index = grouped.index.map({0: \"non-converted\", 1: \"converted\"})\n",
+    "print(grouped)\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(6, 4))\n",
+    "data = [\n",
+    "    window.loc[window[TASK] == 0, \"post_snapshot_touches\"],\n",
+    "    window.loc[window[TASK] == 1, \"post_snapshot_touches\"],\n",
+    "]\n",
+    "ax.boxplot(data, tick_labels=[\"non-converted\", \"converted\"], showfliers=False)\n",
+    "ax.set_ylabel(\"post-snapshot touches (total_touches_all − touch_count)\")\n",
+    "ax.set_title(\n",
+    "    \"Post-snapshot delta by label\\n(roughly the same — section 5 explains why this is misleading)\"\n",
+    ")\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_009",
+   "metadata": {},
+   "source": "## 4. Standalone-AUC probe (the audit that almost lets the trap pass)\n\nA common leakage audit is to fit a one-feature classifier on\neach suspect column and report the standalone AUC. The\nvalidation report does this at scale — its\n`post_snapshot_aggregates` baseline trains a model on the\nsingle column `total_touches_all` and reports an AUC around\n0.55. That sounds tame, and on a busy schedule it's tempting\nto clear the column on those grounds.\n\nWe re-run the probe here so you've seen the number with your\nown eyes: ~0.53. If that's all you measure, the trap looks\nbarely worth mentioning. Section 5 shows what that audit\nmisses."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_010",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "standalone_window = window.dropna(subset=[TRAP, \"touch_count\"]).copy()\n",
+    "y = standalone_window[TASK].astype(int).to_numpy()\n",
+    "standalone = {\n",
+    "    TRAP: float(roc_auc_score(y, standalone_window[TRAP].to_numpy())),\n",
+    "    \"touch_count (snapshot-safe)\": float(\n",
+    "        roc_auc_score(y, standalone_window[\"touch_count\"].to_numpy())\n",
+    "    ),\n",
+    "    \"post-snapshot delta\": float(\n",
+    "        roc_auc_score(y, standalone_window[\"post_snapshot_touches\"].to_numpy())\n",
+    "    ),\n",
+    "}\n",
+    "print(f\"{'feature':<32s}  {'standalone AUC':>16s}\")\n",
+    "for name, auc in standalone.items():\n",
+    "    print(f\"  {name:<30s}  {auc:>16.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_011",
+   "metadata": {},
+   "source": "## 5. Side-by-side AUC: full panel ± trap\n\nTrain two HistGBM and two Logistic Regression baselines on\nthe **same train/test split, same model, same seed** —\nthe only thing that varies is whether `total_touches_all`\nis in the column list.\n\nWe use the full as-shipped feature panel (every public\nsnapshot column except IDs / label) as the baseline. This\nmirrors notebook 01 / the validation report's setup, so the\nwith-trap AUC reproduces the report's published number and\nthe without-trap AUC is what notebook 02 starts from."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_012",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n",
+    "EXCLUDE = set(ID_COLS + [TASK])\n",
+    "\n",
+    "full_cols = [c for c in train.columns if c not in EXCLUDE]\n",
+    "full_cols_no_trap = [c for c in full_cols if c != TRAP]\n",
+    "print(f\"full panel:         {len(full_cols)} cols (incl. {TRAP})\")\n",
+    "print(f\"full panel no trap: {len(full_cols_no_trap)} cols\")\n",
+    "\n",
+    "\n",
+    "def _split_cols(df: pd.DataFrame, cols: list[str]) -> tuple[list[str], list[str]]:\n",
+    "    cat = [\n",
+    "        c\n",
+    "        for c in cols\n",
+    "        if not (pd.api.types.is_bool_dtype(df[c]) or pd.api.types.is_numeric_dtype(df[c]))\n",
+    "    ]\n",
+    "    num = [c for c in cols if c not in cat]\n",
+    "    return num, cat\n",
+    "\n",
+    "\n",
+    "def _sanitize(df: pd.DataFrame, cat_cols: list[str]) -> pd.DataFrame:\n",
+    "    out = df.copy()\n",
+    "    for c in cat_cols:\n",
+    "        out[c] = out[c].astype(object).where(out[c].notna(), None)\n",
+    "    return out\n",
+    "\n",
+    "\n",
+    "def _build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline:\n",
+    "    num_t = Pipeline(\n",
+    "        [\n",
+    "            (\"imputer\", SimpleImputer(strategy=\"median\")),\n",
+    "            (\"scaler\", StandardScaler()),\n",
+    "        ]\n",
+    "    )\n",
+    "    cat_t = Pipeline(\n",
+    "        [\n",
+    "            (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n",
+    "            (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n",
+    "        ]\n",
+    "    )\n",
+    "    pre = ColumnTransformer(\n",
+    "        [(\"num\", num_t, num_cols), (\"cat\", cat_t, cat_cols)],\n",
+    "        remainder=\"drop\",\n",
+    "    )\n",
+    "    if model == \"lr\":\n",
+    "        clf = LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=SEED)\n",
+    "    else:\n",
+    "        clf = HistGradientBoostingClassifier(random_state=SEED)\n",
+    "    return Pipeline([(\"preprocessor\", pre), (\"classifier\", clf)])\n",
+    "\n",
+    "\n",
+    "def fit_score(cols: list[str], *, model: str) -> np.ndarray:\n",
+    "    num_cols, cat_cols = _split_cols(train, cols)\n",
+    "    pipe = _build_pipeline(num_cols, cat_cols, model=model)\n",
+    "    pipe.fit(_sanitize(train[cols], cat_cols), y_train)\n",
+    "    return pipe.predict_proba(_sanitize(test[cols], cat_cols))[:, 1]\n",
+    "\n",
+    "\n",
+    "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n",
+    "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n",
+    "base_rate = float(y_test.mean())\n",
+    "print(f\"test base rate: {base_rate:.3f}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_013",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "results: dict[str, dict[str, float]] = {}\n",
+    "for model in (\"lr\", \"gbm\"):\n",
+    "    p_with = fit_score(full_cols, model=model)\n",
+    "    p_without = fit_score(full_cols_no_trap, model=model)\n",
+    "    results[model] = {\n",
+    "        \"with_trap_auc\": float(roc_auc_score(y_test, p_with)),\n",
+    "        \"without_trap_auc\": float(roc_auc_score(y_test, p_without)),\n",
+    "        \"with_trap_p100\": precision_at_k(p_with, y_test, 100),\n",
+    "        \"without_trap_p100\": precision_at_k(p_without, y_test, 100),\n",
+    "    }\n",
+    "\n",
+    "print(f\"{'model':<5s}  {'with trap':>10s}  {'without trap':>13s}  {'Δ AUC':>8s}\")\n",
+    "for m, r in results.items():\n",
+    "    d = r[\"with_trap_auc\"] - r[\"without_trap_auc\"]\n",
+    "    print(f\"{m:<5s}  {r['with_trap_auc']:>10.4f}  {r['without_trap_auc']:>13.4f}  {d:+8.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_014",
+   "metadata": {},
+   "source": "### 5.1 The lesson — standalone AUC underestimates trap impact\n\nSection 4 says `total_touches_all` is barely above chance\n(~0.53 AUC) on its own. Section 5 says HistGBM extracts a\nsizeable lift (~+0.03 AUC) from the same column once it can\ncombine it with the rest of the feature panel. Both\nmeasurements are correct; they just measure different things.\n\n**Why the gap?** A standalone-AUC probe asks *can this\ncolumn rank leads when it's the only signal you have?* A\ntree model with the rest of the panel already in scope asks\n*can this column refine my existing splits?* The trap's\npost-snapshot information correlates with other columns\nnon-linearly — a few late touches by an outbound rep on an\nengaged-but-not-yet-converted lead is a very different\nsignal from the same touches on a cold lead — and the\ntree can carve the join, while a single-feature probe\nnever sees it. The Logistic Regression gain is much smaller\n(~+0.01) for the same reason: it cannot represent that\ninteraction structure.\n\nBar chart below highlights the asymmetry."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_015",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "labels = [\"GBM full\", \"LR full\"]\n",
+    "deltas = [\n",
+    "    results[\"gbm\"][\"with_trap_auc\"] - results[\"gbm\"][\"without_trap_auc\"],\n",
+    "    results[\"lr\"][\"with_trap_auc\"] - results[\"lr\"][\"without_trap_auc\"],\n",
+    "]\n",
+    "colors = [\"#3b82f6\", \"#9ca3af\"]\n",
+    "fig, ax = plt.subplots(figsize=(6, 4))\n",
+    "ax.bar(range(len(labels)), deltas, color=colors)\n",
+    "ax.axhline(0.0, color=\"#1f2937\", linewidth=0.8)\n",
+    "ax.axhline(\n",
+    "    standalone[TRAP] - 0.5,\n",
+    "    color=\"#ef4444\",\n",
+    "    linestyle=\"--\",\n",
+    "    label=f\"standalone-AUC excess ({standalone[TRAP] - 0.5:+.3f})\",\n",
+    ")\n",
+    "ax.set_xticks(range(len(labels)))\n",
+    "ax.set_xticklabels(labels)\n",
+    "ax.set_ylabel(\"ΔAUC = with_trap − without_trap\")\n",
+    "ax.set_title(\"Trap impact — tree models extract more than the probe predicts\")\n",
+    "ax.legend(loc=\"best\", fontsize=8)\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_016",
+   "metadata": {},
+   "source": "## 6. Tolerance gate (G13.2)\n\nSingle-seed (seed=42) AUCs and trap deltas observed on the\nas-shipped intermediate bundle. Tolerances pin each AUC to\nwithin ±0.02 (well outside numerical jitter, well inside the\nband that would hide a regression). The sign-aware\nassertion below makes the pedagogical claim load-bearing:\nif a regeneration ever neutralises the GBM trap-delta, this\nfails — even if the absolute AUCs stay inside their bands."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_017",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "NB03_TARGETS = {\n",
+    "    \"lr_with_trap_auc\": 0.8827,\n",
+    "    \"lr_without_trap_auc\": 0.8737,\n",
+    "    \"gbm_with_trap_auc\": 0.8754,\n",
+    "    \"gbm_without_trap_auc\": 0.8432,\n",
+    "    \"trap_standalone_auc\": 0.5310,\n",
+    "}\n",
+    "NB03_TOLERANCES = dict.fromkeys(NB03_TARGETS, 0.02)\n",
+    "\n",
+    "observed = {\n",
+    "    \"lr_with_trap_auc\": results[\"lr\"][\"with_trap_auc\"],\n",
+    "    \"lr_without_trap_auc\": results[\"lr\"][\"without_trap_auc\"],\n",
+    "    \"gbm_with_trap_auc\": results[\"gbm\"][\"with_trap_auc\"],\n",
+    "    \"gbm_without_trap_auc\": results[\"gbm\"][\"without_trap_auc\"],\n",
+    "    \"trap_standalone_auc\": standalone[TRAP],\n",
+    "}\n",
+    "assert_within_tolerance(\n",
+    "    observed=observed,\n",
+    "    target=NB03_TARGETS,\n",
+    "    tolerances=NB03_TOLERANCES,\n",
+    "    label=\"notebook 03 trap-panel AUCs (seed 42, intermediate)\",\n",
+    ")\n",
+    "\n",
+    "# Sign-aware: GBM must extract a meaningful lift from the\n",
+    "# trap.  Threshold sits well below the seed-42 observation\n",
+    "# (~+0.032) but well above LR's +0.009, so it specifically\n",
+    "# guards the tree-model lift the section-5 narrative claims.\n",
+    "MIN_GBM_LIFT = 0.015\n",
+    "gbm_lift = results[\"gbm\"][\"with_trap_auc\"] - results[\"gbm\"][\"without_trap_auc\"]\n",
+    "assert gbm_lift > MIN_GBM_LIFT, (\n",
+    "    f\"GBM trap-lift collapsed: {gbm_lift:+.4f} <= {MIN_GBM_LIFT:.4f} — \"\n",
+    "    \"the trap is no longer carrying the pedagogical lesson in section 5\"\n",
+    ")\n",
+    "print(\"OK — trap-panel AUCs in tolerance and GBM lift positive.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_018",
+   "metadata": {},
+   "source": "## 7. A detection recipe you can run on any dataset\n\nThe trap was easy to spot here because the dataset\n*advertises* it. On a third-party dataset you don't get\nthat courtesy. The same recipe still works:\n\n1. **Read any feature dictionary you have.** Any column\n   whose description references a window longer than the\n   prediction horizon is suspicious. Even when no\n   dictionary ships, an obvious naming smell (`*_total`,\n   `*_all`, `*_lifetime`) on a 30-day-snapshot dataset is a\n   flag.\n2. **Probe the standalone AUC** *and* **the contribution to\n   a tree model.** A standalone probe alone undersells\n   tree-friendly leakage (sections 4 and 5 demonstrate why\n   on this dataset). Train a model with the column, train\n   another without, and compare. The ablation captures\n   interactions the standalone probe can't.\n3. **Inspect the time window.** Cross-check the suspect\n   column against any time-stamped event tables. If the\n   column's value can only be explained by events past the\n   snapshot anchor, you've found a trap. Section 3 makes\n   this concrete here — the same technique generalises\n   anywhere there's an event table to corroborate.\n\nA walkthrough of additional detection patterns\n(column-name heuristics, isolation-via-residuals,\ntarget-encoding leakage on test) lives in\n`docs/release/break_me_guide.md` (coming in PR 6.3) — pair\nit with this notebook for a more complete playbook.\n\n## Next\n\n- **Notebook 04** — value-aware ranking\n  (`expected_acv` × P(convert)), calibration plots,\n  threshold selection for top-K capacity, and a\n  cohort-shift / bootstrap robustness harness."
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/release/notebooks/04_lift_calibration_value_ranking.ipynb b/release/notebooks/04_lift_calibration_value_ranking.ipynb
new file mode 100644
index 0000000..a94d64b
--- /dev/null
+++ b/release/notebooks/04_lift_calibration_value_ranking.ipynb
@@ -0,0 +1,643 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "cell_000",
+   "metadata": {},
+   "source": "# Notebook 04 — Lift, Calibration, Value-Aware Ranking\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.\n\nAUC ranks well; that doesn't mean it ranks *for the right\nthing*. Sales teams care about three additional concerns\nAUC alone never tells you about:\n\n1. **Calibration.** Are predicted probabilities trustworthy\n   as point estimates, or just as a ranking?\n2. **Value-aware ranking.** A 30 %-likely lead worth $200K\n   is more valuable than a 60 %-likely one worth $20K.\n   Ranking by P(convert) wastes ACV; ranking by\n   P(convert) × `expected_acv` doesn't.\n3. **Robustness.** Does the model still work next quarter\n   (cohort shift)? How tight is the metric on the test set\n   you have (bootstrap)?\n\nWe answer all three on the public bundle, plus a threshold-\nselection walkthrough that maps a fixed sales-capacity\nconstraint to an operating point. The notebook closes with\na tolerance gate that pins the cohort-shift result to the\npublished validation report — if a regeneration ever\nsilently changes the cohort-degradation behaviour, CI\ncatches it.\n\n**Public path discipline (G13.3).** Loads only\n`release/intermediate/` (the public student bundle).\nInstructor-only artefacts (the latent registry, full-horizon\nevent tables, hidden DAG) are never read.\n\n**Trap discipline.** The headline LR / GBM panel drops\n`total_touches_all` (per notebook 02's leakage discipline)\nso the metrics it reports are honest production numbers.\nThe cohort-shift section deliberately *keeps* the trap to\nreproduce the validation report's cohort-shift block — the\nreport's panel is the as-shipped one, and we want a\ncomparable number, not a cleaner one."
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_001",
+   "metadata": {},
+   "source": "## 1. Setup"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_002",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from __future__ import annotations\n",
+    "\n",
+    "import json\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from sklearn.compose import ColumnTransformer\n",
+    "from sklearn.ensemble import HistGradientBoostingClassifier\n",
+    "from sklearn.impute import SimpleImputer\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.metrics import (\n",
+    "    average_precision_score,\n",
+    "    brier_score_loss,\n",
+    "    roc_auc_score,\n",
+    ")\n",
+    "from sklearn.pipeline import Pipeline\n",
+    "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n",
+    "\n",
+    "sys.path.insert(0, str(Path.cwd()))\n",
+    "from _notebook_utils import assert_within_tolerance\n",
+    "\n",
+    "SEED = 42\n",
+    "BUNDLE = Path(\"../intermediate\")  # public student bundle\n",
+    "TASK = \"converted_within_90_days\"\n",
+    "TRAP = \"total_touches_all\"\n",
+    "\n",
+    "with (BUNDLE / \"manifest.json\").open() as fh:\n",
+    "    manifest = json.load(fh)\n",
+    "assert manifest[\"exposure_mode\"] == \"student_public\"\n",
+    "assert manifest[\"relational_snapshot_safe\"] is True\n",
+    "\n",
+    "train = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"train.parquet\")\n",
+    "test = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"test.parquet\")\n",
+    "print(f\"train rows: {len(train):,}\")\n",
+    "print(f\"test rows:  {len(test):,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_003",
+   "metadata": {},
+   "source": "## 2. Train the headline LR + GBM panel\n\nSame preprocessing as notebooks 01 / 02 (mirrors\n`leadforge.validation.release_quality._build_pipeline`).\nWe drop the documented leakage trap `total_touches_all`\nhere so the calibration / lift / value plots in sections\n3–6 reflect a honest production model. The cohort-shift\nsection in section 7 uses the validator's full-panel\nposture (trap kept) so its number is comparable to the\npublished validation report."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_004",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n",
+    "EXCLUDE_HEADLINE = set(ID_COLS + [TASK, TRAP])\n",
+    "headline_cols = [c for c in train.columns if c not in EXCLUDE_HEADLINE]\n",
+    "cat_cols = [\n",
+    "    c\n",
+    "    for c in headline_cols\n",
+    "    if not (pd.api.types.is_bool_dtype(train[c]) or pd.api.types.is_numeric_dtype(train[c]))\n",
+    "]\n",
+    "num_cols = [c for c in headline_cols if c not in cat_cols]\n",
+    "print(f\"headline panel: {len(headline_cols)} cols (trap dropped)\")\n",
+    "\n",
+    "\n",
+    "def _sanitize(df: pd.DataFrame, cats: list[str]) -> pd.DataFrame:\n",
+    "    out = df.copy()\n",
+    "    for c in cats:\n",
+    "        out[c] = out[c].astype(object).where(out[c].notna(), None)\n",
+    "    return out\n",
+    "\n",
+    "\n",
+    "def build_pipeline(num: list[str], cat: list[str], *, model: str) -> Pipeline:\n",
+    "    pre = ColumnTransformer(\n",
+    "        [\n",
+    "            (\n",
+    "                \"num\",\n",
+    "                Pipeline(\n",
+    "                    [\n",
+    "                        (\"imputer\", SimpleImputer(strategy=\"median\")),\n",
+    "                        (\"scaler\", StandardScaler()),\n",
+    "                    ]\n",
+    "                ),\n",
+    "                num,\n",
+    "            ),\n",
+    "            (\n",
+    "                \"cat\",\n",
+    "                Pipeline(\n",
+    "                    [\n",
+    "                        (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n",
+    "                        (\n",
+    "                            \"encoder\",\n",
+    "                            OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False),\n",
+    "                        ),\n",
+    "                    ]\n",
+    "                ),\n",
+    "                cat,\n",
+    "            ),\n",
+    "        ],\n",
+    "        remainder=\"drop\",\n",
+    "    )\n",
+    "    clf = (\n",
+    "        LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=SEED)\n",
+    "        if model == \"lr\"\n",
+    "        else HistGradientBoostingClassifier(random_state=SEED)\n",
+    "    )\n",
+    "    return Pipeline([(\"preprocessor\", pre), (\"classifier\", clf)])\n",
+    "\n",
+    "\n",
+    "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n",
+    "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n",
+    "base_rate = float(y_test.mean())\n",
+    "\n",
+    "x_train = _sanitize(train[headline_cols], cat_cols)\n",
+    "x_test = _sanitize(test[headline_cols], cat_cols)\n",
+    "\n",
+    "lr_pipe = build_pipeline(num_cols, cat_cols, model=\"lr\").fit(x_train, y_train)\n",
+    "gbm_pipe = build_pipeline(num_cols, cat_cols, model=\"gbm\").fit(x_train, y_train)\n",
+    "lr_probs = lr_pipe.predict_proba(x_test)[:, 1]\n",
+    "gbm_probs = gbm_pipe.predict_proba(x_test)[:, 1]\n",
+    "\n",
+    "print(f\"  base rate: {base_rate:.3f}\")\n",
+    "print(\n",
+    "    f\"  LR   AUC: {roc_auc_score(y_test, lr_probs):.4f}   \"\n",
+    "    f\"AP: {average_precision_score(y_test, lr_probs):.4f}   \"\n",
+    "    f\"Brier: {brier_score_loss(y_test, lr_probs):.4f}\"\n",
+    ")\n",
+    "print(\n",
+    "    f\"  GBM  AUC: {roc_auc_score(y_test, gbm_probs):.4f}   \"\n",
+    "    f\"AP: {average_precision_score(y_test, gbm_probs):.4f}   \"\n",
+    "    f\"Brier: {brier_score_loss(y_test, gbm_probs):.4f}\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_005",
+   "metadata": {},
+   "source": "## 3. Calibration / reliability diagram\n\nBin LR's predicted probabilities into ten equal-width\nbuckets, plot mean predicted vs mean observed. A perfectly\ncalibrated model lies on the diagonal; LR after\n`StandardScaler + LogisticRegression` is usually close.\nWe also surface `max_bin_error` — the worst gap across\nnon-empty bins — which the validation report tracks\n(`tiers.intermediate.medians.calibration_max_bin_error`)."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_006",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "edges = np.linspace(0.0, 1.0, 11)\n",
+    "mean_pred: list[float] = []\n",
+    "mean_actual: list[float] = []\n",
+    "bin_n: list[int] = []\n",
+    "for i in range(10):\n",
+    "    lo, hi = edges[i], edges[i + 1]\n",
+    "    mask = (lr_probs >= lo) & ((lr_probs <= hi) if i == 9 else (lr_probs < hi))\n",
+    "    if mask.sum() == 0:\n",
+    "        continue\n",
+    "    mean_pred.append(float(lr_probs[mask].mean()))\n",
+    "    mean_actual.append(float(y_test[mask].mean()))\n",
+    "    bin_n.append(int(mask.sum()))\n",
+    "\n",
+    "max_bin_err = max(abs(p - a) for p, a in zip(mean_pred, mean_actual, strict=False))\n",
+    "print(f\"max bin error (LR): {max_bin_err:.4f}\")\n",
+    "for p, a, n in zip(mean_pred, mean_actual, bin_n, strict=False):\n",
+    "    print(f\"  pred={p:.3f}  actual={a:.3f}  n={n:>4d}\")\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(5, 5))\n",
+    "ax.plot([0, 1], [0, 1], color=\"#9ca3af\", linestyle=\"--\", label=\"perfect calibration\")\n",
+    "ax.plot(mean_pred, mean_actual, marker=\"o\", color=\"#3b82f6\", label=\"LR\")\n",
+    "ax.set_xlim(0, 1)\n",
+    "ax.set_ylim(0, 1)\n",
+    "ax.set_xlabel(\"Mean predicted probability\")\n",
+    "ax.set_ylabel(\"Observed conversion rate\")\n",
+    "ax.set_title(\"Calibration — LR, intermediate tier (seed 42)\")\n",
+    "ax.legend(loc=\"upper left\")\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_007",
+   "metadata": {},
+   "source": "## 4. Lift and cumulative gains\n\nTwo complementary curves:\n\n* **Cumulative gains** — fraction of positives captured as\n  you sweep the score threshold. Top 10 % of the ranked\n  list captures ~26 % of converted leads on this seed (vs\n  the 10 % a random ranker would catch).\n* **Lift at *k* %** — `top_k_conversion_rate / base_rate`.\n  Lift = 2 means \"the top 1 % of leads convert at twice\n  the base rate.\"\n\nBoth metrics are in `release/validation/validation_report.json`\n(`per_seed[0].cumulative_gains` and `per_seed[0].lift_at_pct`)\nso the reproduction is auditable."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_008",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "order = np.argsort(-lr_probs, kind=\"stable\")\n",
+    "y_sorted = y_test[order]\n",
+    "n = len(y_test)\n",
+    "n_pos = int(y_test.sum())\n",
+    "\n",
+    "# Cumulative gains: fraction of positives captured by top-pct.\n",
+    "pcts = np.arange(0, 101, 10)\n",
+    "gains = []\n",
+    "for pct in pcts:\n",
+    "    k = max(1, int(round(n * pct / 100.0)))\n",
+    "    if pct == 0:\n",
+    "        gains.append(0.0)\n",
+    "    else:\n",
+    "        gains.append(float(y_sorted[:k].sum() / n_pos))\n",
+    "\n",
+    "# Lift at 1 / 5 / 10 %.\n",
+    "lifts = {}\n",
+    "for pct in [1.0, 5.0, 10.0]:\n",
+    "    k = max(1, int(round(n * pct / 100.0)))\n",
+    "    lifts[pct] = float(y_sorted[:k].mean() / base_rate)\n",
+    "\n",
+    "for pct, lift in lifts.items():\n",
+    "    print(f\"  lift @ top {pct:>4.0f}%: {lift:.3f}x\")\n",
+    "\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(11, 4))\n",
+    "axes[0].plot(pcts, gains, marker=\"o\", color=\"#3b82f6\", label=\"LR\")\n",
+    "axes[0].plot([0, 100], [0, 1], color=\"#9ca3af\", linestyle=\"--\", label=\"random\")\n",
+    "axes[0].set_xlabel(\"Top-pct of ranked leads\")\n",
+    "axes[0].set_ylabel(\"Cumulative conversion capture\")\n",
+    "axes[0].set_title(\"Cumulative gains\")\n",
+    "axes[0].legend(loc=\"lower right\")\n",
+    "\n",
+    "axes[1].bar(\n",
+    "    [str(int(p)) for p in lifts],\n",
+    "    list(lifts.values()),\n",
+    "    color=\"#3b82f6\",\n",
+    ")\n",
+    "axes[1].axhline(1.0, color=\"#ef4444\", linestyle=\"--\", label=\"random (lift=1)\")\n",
+    "axes[1].set_xlabel(\"Top-pct of ranked leads\")\n",
+    "axes[1].set_ylabel(\"Lift over base rate\")\n",
+    "axes[1].set_title(\"Lift at top-pct\")\n",
+    "axes[1].legend()\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_009",
+   "metadata": {},
+   "source": "## 5. Value-aware ranking — `expected_acv` × P(convert)\n\nSales reps don't have infinite capacity, so the right\nobjective is rarely \"maximise conversion count\" — it's\n\"maximise revenue captured per outreach slot.\" The bundle\nships an `expected_acv` column (opportunity ACV when\navailable, else revenue-band midpoint heuristic) which\nmakes value-aware ranking trivial:\n\n$$ \\text{score}_\\text{value} = P(\\text{convert}) \\times\n\\text{expected\\_acv} $$\n\nWe compare two top-K policies — rank by P(convert) only\nvs rank by score_value — and report\n`expected_acv_capture_at_k = sum(acv * y) over top-K /\nsum(acv * y) over the whole test`. The validation report's\n`per_seed[0].expected_acv_capture_at_k` is the reference."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_010",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "acv = pd.to_numeric(test[\"expected_acv\"], errors=\"coerce\").fillna(0.0).to_numpy()\n",
+    "value_score = lr_probs * acv\n",
+    "\n",
+    "\n",
+    "def acv_capture(scores: np.ndarray, k: int) -> float:\n",
+    "    order = np.argsort(-scores, kind=\"stable\")\n",
+    "    captured = float(np.sum(acv[order[:k]] * y_test[order[:k]]))\n",
+    "    total = float(np.sum(acv * y_test))\n",
+    "    return captured / total if total > 0 else float(\"nan\")\n",
+    "\n",
+    "\n",
+    "print(f\"{'top-K':<6s}  {'cap by P(conv)':>14s}  {'cap by P×ACV':>13s}  {'gain':>7s}\")\n",
+    "value_gains = {}\n",
+    "for k in (50, 100, 200):\n",
+    "    cap_p = acv_capture(lr_probs, k)\n",
+    "    cap_v = acv_capture(value_score, k)\n",
+    "    value_gains[k] = cap_v - cap_p\n",
+    "    print(f\"  top {k:<3d}  {cap_p:>14.4f}  {cap_v:>13.4f}  {cap_v - cap_p:+7.4f}\")\n",
+    "\n",
+    "# Plot side-by-side ACV capture for K in 10..300.\n",
+    "ks = np.arange(10, 301, 10)\n",
+    "cap_p = [acv_capture(lr_probs, int(k)) for k in ks]\n",
+    "cap_v = [acv_capture(value_score, int(k)) for k in ks]\n",
+    "fig, ax = plt.subplots(figsize=(7, 4))\n",
+    "ax.plot(ks, cap_p, marker=\"o\", color=\"#9ca3af\", label=\"rank by P(convert)\")\n",
+    "ax.plot(ks, cap_v, marker=\"o\", color=\"#3b82f6\", label=\"rank by P(convert)×ACV\")\n",
+    "ax.set_xlabel(\"top-K leads contacted\")\n",
+    "ax.set_ylabel(\"Fraction of converted-ACV captured\")\n",
+    "ax.set_title(\"Value-aware ranking captures more revenue per outreach slot\")\n",
+    "ax.legend(loc=\"lower right\")\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_011",
+   "metadata": {},
+   "source": "## 6. Threshold selection for fixed top-K capacity\n\nSales rarely has the patience for \"score everything, run\nstats.\" The realistic ask is: *\"My team can work 50 leads\nthis week. Set a probability threshold that selects ~50\nfrom the test population.\"*\n\nWe sweep the probability threshold across the LR score\ndistribution and report precision / recall / count above\nthreshold for each step, then pick the threshold whose\ncount is closest to the requested capacity."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_012",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "CAPACITY = 50\n",
+    "\n",
+    "sorted_probs = np.sort(lr_probs)[::-1]\n",
+    "# The K-th highest probability is the smallest threshold that\n",
+    "# admits exactly K leads (ties resolved by score order).\n",
+    "threshold = float(sorted_probs[CAPACITY - 1])\n",
+    "mask = lr_probs >= threshold\n",
+    "n_above = int(mask.sum())\n",
+    "prec = float(y_test[mask].mean()) if n_above > 0 else float(\"nan\")\n",
+    "recall = float(y_test[mask].sum() / max(int(y_test.sum()), 1))\n",
+    "print(\n",
+    "    f\"capacity={CAPACITY}  threshold={threshold:.3f}  \"\n",
+    "    f\"actually_above={n_above}  precision={prec:.3f}  recall={recall:.3f}\"\n",
+    ")\n",
+    "\n",
+    "# Threshold sweep — show what happens around the operating\n",
+    "# point so the threshold choice is informed, not magic.\n",
+    "thresholds = np.linspace(float(np.quantile(lr_probs, 0.5)), float(np.max(lr_probs)), 30)\n",
+    "counts = [int((lr_probs >= t).sum()) for t in thresholds]\n",
+    "precs = [\n",
+    "    float(y_test[lr_probs >= t].mean()) if (lr_probs >= t).sum() > 0 else 0.0 for t in thresholds\n",
+    "]\n",
+    "\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(11, 4))\n",
+    "axes[0].plot(thresholds, counts, marker=\"o\", color=\"#3b82f6\")\n",
+    "axes[0].axhline(CAPACITY, color=\"#ef4444\", linestyle=\"--\", label=f\"capacity={CAPACITY}\")\n",
+    "axes[0].axvline(threshold, color=\"#10b981\", linestyle=\"--\", label=f\"chosen ({threshold:.3f})\")\n",
+    "axes[0].set_xlabel(\"threshold\")\n",
+    "axes[0].set_ylabel(\"# leads above threshold\")\n",
+    "axes[0].set_title(\"Threshold sweep — count above\")\n",
+    "axes[0].legend()\n",
+    "\n",
+    "axes[1].plot(thresholds, precs, marker=\"o\", color=\"#3b82f6\")\n",
+    "axes[1].axhline(base_rate, color=\"#9ca3af\", linestyle=\"--\", label=f\"base rate ({base_rate:.3f})\")\n",
+    "axes[1].axvline(threshold, color=\"#10b981\", linestyle=\"--\", label=f\"chosen ({threshold:.3f})\")\n",
+    "axes[1].set_xlabel(\"threshold\")\n",
+    "axes[1].set_ylabel(\"precision above threshold\")\n",
+    "axes[1].set_title(\"Threshold sweep — precision above\")\n",
+    "axes[1].legend()\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_013",
+   "metadata": {},
+   "source": "## 7. Cohort-shift evaluation\n\nThe bundle's train/test split is a uniform random split of\nleads. A more realistic stress test is \"train on the first\n70 % of leads chronologically, score the last 30 % \"—\nbecause in production you always have to predict the\n*future*, never a held-out random sample of the past.\n\nWe mirror the validator's cohort-shift logic\n(`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`):\npool train + test, sort by `lead_created_at` with `lead_id`\nas a stable tiebreak, train HistGBM on the first 85 % and\nscore the last 15 % (the validator's `COHORT_TRAIN_FRAC`).\nBoth random and cohort splits use the full feature panel\n**including** the trap, matching the report's posture so\nthe numbers compare directly. The HistGBM uses\n`random_state=0` here (the validator's\n`DEFAULT_MODEL_RANDOM_STATE`) instead of the notebook's\ndefault `SEED=42` — that matters for the cohort-shift\nreproduction down to the third decimal.\n\nThe expected behaviour for the v1 intermediate tier is\n*no* degradation — the report shows the cohort split AUC\nrunning ~0.015 *higher* than the random split. That's a\nsurprise worth surfacing: the v1 simulator's intermediate\nworld doesn't drift over its 90-day horizon, so cohort\norder isn't a stressor here. The intro and advanced\ntiers show small positive degradations (intro +0.016,\nadvanced +0.010) — see\n`release/validation/validation_report.json` ⇒\n`cohort_shift`."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_014",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Constants mirror leadforge.validation.release_quality so\n",
+    "# the numbers reproduce the report's cohort-shift block.\n",
+    "COHORT_TRAIN_FRAC = 0.85\n",
+    "COHORT_MODEL_SEED = 0\n",
+    "\n",
+    "# Cohort-shift uses the validator's full panel (trap kept).\n",
+    "EXCLUDE_FULL = set(ID_COLS + [TASK])\n",
+    "full_cols = [c for c in train.columns if c not in EXCLUDE_FULL]\n",
+    "cat_full = [\n",
+    "    c\n",
+    "    for c in full_cols\n",
+    "    if not (pd.api.types.is_bool_dtype(train[c]) or pd.api.types.is_numeric_dtype(train[c]))\n",
+    "]\n",
+    "num_full = [c for c in full_cols if c not in cat_full]\n",
+    "\n",
+    "\n",
+    "def _gbm_pipeline_for_cohort() -> Pipeline:\n",
+    "    # Local builder so the validator's ``model_random_state=0``\n",
+    "    # is used here, while the headline panel above keeps\n",
+    "    # ``random_state=SEED`` for the section-2 LR/GBM models.\n",
+    "    pre = ColumnTransformer(\n",
+    "        [\n",
+    "            (\n",
+    "                \"num\",\n",
+    "                Pipeline(\n",
+    "                    [\n",
+    "                        (\"imputer\", SimpleImputer(strategy=\"median\")),\n",
+    "                        (\"scaler\", StandardScaler()),\n",
+    "                    ]\n",
+    "                ),\n",
+    "                num_full,\n",
+    "            ),\n",
+    "            (\n",
+    "                \"cat\",\n",
+    "                Pipeline(\n",
+    "                    [\n",
+    "                        (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n",
+    "                        (\n",
+    "                            \"encoder\",\n",
+    "                            OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False),\n",
+    "                        ),\n",
+    "                    ]\n",
+    "                ),\n",
+    "                cat_full,\n",
+    "            ),\n",
+    "        ],\n",
+    "        remainder=\"drop\",\n",
+    "    )\n",
+    "    clf = HistGradientBoostingClassifier(random_state=COHORT_MODEL_SEED)\n",
+    "    return Pipeline([(\"preprocessor\", pre), (\"classifier\", clf)])\n",
+    "\n",
+    "\n",
+    "# Random split AUC = HistGBM on the bundle's existing split.\n",
+    "rand_pipe = _gbm_pipeline_for_cohort().fit(_sanitize(train[full_cols], cat_full), y_train)\n",
+    "random_split_auc = float(\n",
+    "    roc_auc_score(\n",
+    "        y_test,\n",
+    "        rand_pipe.predict_proba(_sanitize(test[full_cols], cat_full))[:, 1],\n",
+    "    )\n",
+    ")\n",
+    "\n",
+    "# Chronological resplit: pool, sort by lead_created_at +\n",
+    "# lead_id (stable tiebreak), take first 85 % as train, last\n",
+    "# 15 % as test.  Mirrors ``measure_cohort_shift_from_bundle``.\n",
+    "pooled = pd.concat([train, test], ignore_index=True)\n",
+    "ts = pd.to_datetime(pooled[\"lead_created_at\"], errors=\"coerce\")\n",
+    "assert not ts.isna().any(), \"expected every lead to have a parseable lead_created_at\"\n",
+    "sort_frame = pd.DataFrame({\"_ts\": ts.values, \"_lid\": pooled[\"lead_id\"].astype(str).values})\n",
+    "order = sort_frame.sort_values([\"_ts\", \"_lid\"], kind=\"stable\").index.to_numpy()\n",
+    "cutoff = int(round(len(pooled) * COHORT_TRAIN_FRAC))\n",
+    "early = pooled.iloc[order[:cutoff]]\n",
+    "late = pooled.iloc[order[cutoff:]]\n",
+    "y_early = early[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n",
+    "y_late = late[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n",
+    "\n",
+    "cohort_pipe = _gbm_pipeline_for_cohort().fit(_sanitize(early[full_cols], cat_full), y_early)\n",
+    "cohort_split_auc = float(\n",
+    "    roc_auc_score(\n",
+    "        y_late,\n",
+    "        cohort_pipe.predict_proba(_sanitize(late[full_cols], cat_full))[:, 1],\n",
+    "    )\n",
+    ")\n",
+    "auc_degradation = random_split_auc - cohort_split_auc\n",
+    "print(f\"random_split_auc:  {random_split_auc:.4f}\")\n",
+    "print(f\"cohort_split_auc:  {cohort_split_auc:.4f}\")\n",
+    "print(f\"auc_degradation:   {auc_degradation:+.4f}  (positive = cohort is harder)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_015",
+   "metadata": {},
+   "source": "## 8. Bootstrap robustness — within-bundle metric variance\n\nCross-seed metric variance (the validation report's\n`tiers.intermediate.spreads.gbm_auc = 0.027`) is the\ncleanest answer to \"how confident is this AUC?\", but it\nrequires regenerating the bundle from N seeds — something\na public-bundle consumer (Kaggle / HF) can't easily do.\n\nThe within-bundle proxy is **non-parametric bootstrap of\nthe test set**. We resample the 750 test rows with\nreplacement, re-rank using the model probabilities we\nalready have, and recompute AUC / AP. 200 resamples is\nenough to read a confidence band off the distribution.\n\nThe bootstrap variance is **smaller** than the cross-seed\nvariance — it captures sampling noise on a single\ngenerated world, not generation-process noise across\nseeds — but it's the right number for the question\n\"given *this* test set, how stable is the AUC?\""
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_016",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "N_BOOT = 200\n",
+    "rng = np.random.default_rng(SEED)\n",
+    "\n",
+    "boot_lr_auc = np.empty(N_BOOT)\n",
+    "boot_gbm_auc = np.empty(N_BOOT)\n",
+    "boot_lr_ap = np.empty(N_BOOT)\n",
+    "n_test = len(y_test)\n",
+    "for i in range(N_BOOT):\n",
+    "    idx = rng.integers(0, n_test, n_test)\n",
+    "    if y_test[idx].sum() == 0 or y_test[idx].sum() == n_test:\n",
+    "        # Degenerate resample — re-roll.\n",
+    "        boot_lr_auc[i] = np.nan\n",
+    "        boot_gbm_auc[i] = np.nan\n",
+    "        boot_lr_ap[i] = np.nan\n",
+    "        continue\n",
+    "    boot_lr_auc[i] = roc_auc_score(y_test[idx], lr_probs[idx])\n",
+    "    boot_gbm_auc[i] = roc_auc_score(y_test[idx], gbm_probs[idx])\n",
+    "    boot_lr_ap[i] = average_precision_score(y_test[idx], lr_probs[idx])\n",
+    "\n",
+    "\n",
+    "def _summary(arr: np.ndarray, name: str) -> None:\n",
+    "    arr = arr[~np.isnan(arr)]\n",
+    "    lo, med, hi = np.quantile(arr, [0.025, 0.5, 0.975])\n",
+    "    print(\n",
+    "        f\"  {name:<14s}  median={med:.4f}  \"\n",
+    "        f\"95% CI=[{lo:.4f}, {hi:.4f}]  IQR={(np.quantile(arr, 0.75) - np.quantile(arr, 0.25)):.4f}\"\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "print(f\"bootstrap on test set, n_iters={N_BOOT}, seed={SEED}:\")\n",
+    "_summary(boot_lr_auc, \"LR AUC\")\n",
+    "_summary(boot_gbm_auc, \"GBM AUC\")\n",
+    "_summary(boot_lr_ap, \"LR AP\")\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(7, 4))\n",
+    "ax.hist(boot_lr_auc, bins=30, color=\"#3b82f6\", alpha=0.7, label=\"LR AUC\")\n",
+    "ax.hist(boot_gbm_auc, bins=30, color=\"#9ca3af\", alpha=0.7, label=\"GBM AUC\")\n",
+    "ax.axvline(roc_auc_score(y_test, lr_probs), color=\"#1d4ed8\", linestyle=\"--\", label=\"LR (point)\")\n",
+    "ax.axvline(roc_auc_score(y_test, gbm_probs), color=\"#374151\", linestyle=\"--\", label=\"GBM (point)\")\n",
+    "ax.set_xlabel(\"AUC\")\n",
+    "ax.set_ylabel(\"# bootstrap draws\")\n",
+    "ax.set_title(f\"Bootstrap AUC distribution (n={N_BOOT})\")\n",
+    "ax.legend(loc=\"upper left\", fontsize=8)\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_017",
+   "metadata": {},
+   "source": "## 9. Tolerance gate (G13.2)\n\nThree groups of pinned values:\n\n* **Cohort-shift block** — pinned to\n  `release/notebooks/_release_targets.json`'s\n  `cohort_shift.intermediate`, which is itself audit-synced\n  against `validation_report.json`'s `cohort_shift.intermediate`\n  by `tests/release/notebooks/test_release_targets_match_report.py`.\n  That audit-sync is what makes the \"this notebook\n  reproduces the report\" claim meaningful.\n* **Calibration / lift / value-capture** — pinned inline\n  against the seed-42 single-run values from the\n  validation report's `per_seed[0]` block. Tolerances\n  widen for small-K metrics (P@K, value capture) because\n  their seed-to-seed variance is larger.\n* **Bootstrap medians** — pinned inline against the\n  seed-42 point estimates (the bootstrap median converges\n  to the data-specific value, not to the cross-seed\n  median).\n\nThe headline lift sign-check (`gbm_auc > lr_auc - eps` was\n*not* asserted — the v1 dataset documents the surprising\nfinding that LR ≥ GBM on intermediate; see\n`release/validation/validation_report.md` gate G7.4.4)."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_018",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with (Path.cwd() / \"_release_targets.json\").open() as fh:\n",
+    "    release_targets = json.load(fh)\n",
+    "cohort_targets = release_targets[\"cohort_shift\"][\"intermediate\"]\n",
+    "\n",
+    "cohort_observed = {\n",
+    "    \"random_split_auc\": random_split_auc,\n",
+    "    \"cohort_split_auc\": cohort_split_auc,\n",
+    "    \"auc_degradation\": auc_degradation,\n",
+    "}\n",
+    "assert_within_tolerance(\n",
+    "    observed=cohort_observed,\n",
+    "    target=cohort_targets,\n",
+    "    tolerances={\n",
+    "        # ±0.02 on AUCs — well outside numerical jitter,\n",
+    "        # well inside the band that would let the\n",
+    "        # cohort-shift sign flip silently.\n",
+    "        \"random_split_auc\": 0.02,\n",
+    "        \"cohort_split_auc\": 0.02,\n",
+    "        # Wider on the difference because both AUCs are\n",
+    "        # within tolerance, so the difference can drift up\n",
+    "        # to ±0.04 in the worst case.\n",
+    "        \"auc_degradation\": 0.04,\n",
+    "    },\n",
+    "    label=\"notebook 04 cohort-shift vs validation_report (intermediate)\",\n",
+    ")\n",
+    "\n",
+    "# Inline pins for the seed-42 single-run values *of the\n",
+    "# without-trap headline panel*.  These are not the report's\n",
+    "# published numbers (the report keeps the trap) — the\n",
+    "# report-level pin lives in section 9's cohort-shift block,\n",
+    "# which is the only metric this notebook reproduces against\n",
+    "# the report.  Notebook 02 trains the same trap-dropped LR\n",
+    "# and reports the same AUCs, so these values are also\n",
+    "# cross-checked there.\n",
+    "NB04_TARGETS = {\n",
+    "    \"lr_auc\": 0.8737,\n",
+    "    \"gbm_auc\": 0.8432,\n",
+    "    \"lr_max_bin_err\": 0.1344,\n",
+    "    \"lift_at_5pct\": 2.4819,\n",
+    "    \"lift_at_10pct\": 2.7536,\n",
+    "    \"acv_cap_50\": 0.1615,\n",
+    "    \"acv_cap_100\": 0.3702,\n",
+    "    # Bootstrap medians converge to the seed-42 point\n",
+    "    # estimates within sampling noise.\n",
+    "    \"boot_lr_auc_median\": 0.8757,\n",
+    "    \"boot_gbm_auc_median\": 0.8440,\n",
+    "}\n",
+    "NB04_TOLERANCES = {\n",
+    "    \"lr_auc\": 0.02,\n",
+    "    \"gbm_auc\": 0.02,\n",
+    "    \"lr_max_bin_err\": 0.05,\n",
+    "    \"lift_at_5pct\": 0.30,\n",
+    "    \"lift_at_10pct\": 0.30,\n",
+    "    \"acv_cap_50\": 0.05,\n",
+    "    \"acv_cap_100\": 0.05,\n",
+    "    \"boot_lr_auc_median\": 0.03,\n",
+    "    \"boot_gbm_auc_median\": 0.03,\n",
+    "}\n",
+    "observed = {\n",
+    "    \"lr_auc\": float(roc_auc_score(y_test, lr_probs)),\n",
+    "    \"gbm_auc\": float(roc_auc_score(y_test, gbm_probs)),\n",
+    "    \"lr_max_bin_err\": float(max_bin_err),\n",
+    "    \"lift_at_5pct\": lifts[5.0],\n",
+    "    \"lift_at_10pct\": lifts[10.0],\n",
+    "    \"acv_cap_50\": acv_capture(lr_probs, 50),\n",
+    "    \"acv_cap_100\": acv_capture(lr_probs, 100),\n",
+    "    \"boot_lr_auc_median\": float(np.nanmedian(boot_lr_auc)),\n",
+    "    \"boot_gbm_auc_median\": float(np.nanmedian(boot_gbm_auc)),\n",
+    "}\n",
+    "assert_within_tolerance(\n",
+    "    observed=observed,\n",
+    "    target=NB04_TARGETS,\n",
+    "    tolerances=NB04_TOLERANCES,\n",
+    "    label=\"notebook 04 metric panel (seed 42, intermediate)\",\n",
+    ")\n",
+    "\n",
+    "# Sign-aware: value-aware ranking should not be worse than\n",
+    "# P-only ranking on aggregate.  The headline finding stays\n",
+    "# in the narrative regardless of the exact numbers.\n",
+    "for k, gain in value_gains.items():\n",
+    "    assert gain >= -0.01, (\n",
+    "        f\"value-aware ranking lost ground at top-{k} ({gain:+.4f}); \"\n",
+    "        \"the P×ACV story is no longer load-bearing\"\n",
+    "    )\n",
+    "print(\"OK — cohort-shift, calibration, lift, value-capture, and bootstrap all pinned.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_019",
+   "metadata": {},
+   "source": "## 10. Summary\n\n* The LR baseline is well-calibrated (max bin error ≈ 0.19\n  on this seed) and lifts the top decile to ~2.6× the base\n  rate.\n* Value-aware ranking (P × ACV) captures more revenue per\n  top-K slot than P-only ranking — the gap depends on K\n  but is positive across all sizes we tested.\n* Cohort shift is **negative** on the intermediate tier\n  (the late cohort is *easier*, not harder); the report\n  documents this, and the notebook reproduces it. The\n  intro and advanced tiers show small positive\n  degradations.\n* Bootstrap on the existing test split gives a within-\n  bundle confidence band that's tighter than the cross-seed\n  spread the validation report computes — useful for \"how\n  confident is this single AUC\" questions, not for \"how\n  much does the bundle move across seeds.\"\n\n## Where to go next\n\n1. Try cohort-shifted training in production: refit weekly\n   on the trailing 60-day window, score the next 7 days.\n2. If you have real ACV data, swap the `expected_acv`\n   heuristic for it and recompute section 5 — the revenue\n   capture story should sharpen.\n3. The break-me playbook in `docs/release/break_me_guide.md`\n   (coming in PR 6.3) catalogues additional stress tests\n   (target-encoding leakage, train-test contamination,\n   cohort-by-segment) and how to detect each from a\n   single bundle."
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/release/notebooks/_release_targets.json b/release/notebooks/_release_targets.json
index 7ed4cdb..9d36f7c 100644
--- a/release/notebooks/_release_targets.json
+++ b/release/notebooks/_release_targets.json
@@ -1,5 +1,13 @@
 {
  "_doc": "Cross-seed-median metric values from release/validation/validation_report.json, sliced to the metrics the release notebooks pin via assert_within_tolerance. Audited against the report by tests/release/notebooks/test_release_targets_match_report.py — if you change a value here, the test will fail unless the corresponding median in the validation report changes to match.",
+ "cohort_shift": {
+  "_doc": "Per-tier cohort-shift metrics from validation_report.cohort_shift (single-seed values; the report runs cohort-shift only on seed 42). Notebook 04 reproduces these via a chronological resplit and pins them via assert_within_tolerance.",
+  "intermediate": {
+   "auc_degradation": -0.015458147938307687,
+   "cohort_split_auc": 0.8908394607843138,
+   "random_split_auc": 0.8753813128460061
+  }
+ },
  "intermediate": {
   "brier_score": 0.10963449613199748,
   "gbm_auc": 0.875461913160326,
diff --git a/scripts/build_release_notebook_03.py b/scripts/build_release_notebook_03.py
new file mode 100644
index 0000000..3df45bb
--- /dev/null
+++ b/scripts/build_release_notebook_03.py
@@ -0,0 +1,542 @@
+"""One-shot builder for ``release/notebooks/03_leakage_and_time_windows.ipynb``.
+
+Run from the repository root::
+
+    python scripts/build_release_notebook_03.py
+
+Cells are assigned deterministic IDs by ``_release_notebook_common`` so
+re-running yields a byte-identical file — same audit-artifact-sync
+pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts.
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent))
+
+import nbformat as nbf  # noqa: E402 — must follow sys.path insert
+from _release_notebook_common import (  # noqa: E402 — must follow sys.path insert
+    assemble_notebook,
+    builder_arg_parser,
+    code,
+    md,
+    write_notebook,
+)
+
+DEFAULT_OUT = (
+    Path(__file__).resolve().parents[1]
+    / "release"
+    / "notebooks"
+    / "03_leakage_and_time_windows.ipynb"
+)
+
+
+def cells() -> list[nbf.NotebookNode]:
+    return [
+        md(
+            """
+            # Notebook 03 — Leakage and Time Windows
+
+            **Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.
+
+            The bundle ships with one **deliberate leakage trap**:
+            `total_touches_all`. The feature dictionary marks it
+            `leakage_risk = True`; the dataset card calls it out;
+            notebook 01 keeps it (matching the validation report's
+            panel) while notebook 02 drops it. This notebook turns the
+            trap into a teaching moment.
+
+            We do four things:
+
+            1. **Read the receipts.** The trap is documented in the
+               feature dictionary. We surface that label.
+            2. **Time-window proof.** Quantify how much
+               `total_touches_all` differs from its snapshot-safe
+               sibling `touch_count` — the difference is post-snapshot
+               information by construction, regardless of how
+               predictive that information turns out to be.
+            3. **The lesson.** Run a single-column standalone-AUC probe
+               on the trap (it looks innocuous, ~0.53). Then run the
+               full-panel ± trap comparison: HistGBM extracts a
+               substantial AUC lift (+0.03) from a column whose
+               standalone AUC is barely above chance. Standalone
+               probes undersell tree-friendly leakage.
+            4. **Pin the deltas.** Sign-aware tolerance gates so a
+               future regeneration that neutralises the trap (or
+               accidentally amplifies it) breaks CI.
+
+            **Public path discipline (G13.3).** This notebook reads only
+            from `release/intermediate/` (the public student bundle). The
+            instructor companion is **not** loaded — leakage detection
+            has to work from the public artefact alone, since that's all
+            a downstream consumer ever has.
+            """
+        ),
+        md("## 1. Setup"),
+        code(
+            """
+            from __future__ import annotations
+
+            import json
+            import sys
+            from pathlib import Path
+
+            import matplotlib.pyplot as plt
+            import numpy as np
+            import pandas as pd
+            from sklearn.compose import ColumnTransformer
+            from sklearn.ensemble import HistGradientBoostingClassifier
+            from sklearn.impute import SimpleImputer
+            from sklearn.linear_model import LogisticRegression
+            from sklearn.metrics import roc_auc_score
+            from sklearn.pipeline import Pipeline
+            from sklearn.preprocessing import OneHotEncoder, StandardScaler
+
+            sys.path.insert(0, str(Path.cwd()))
+            from _notebook_utils import assert_within_tolerance, precision_at_k
+
+            SEED = 42
+            BUNDLE = Path("../intermediate")          # public student bundle
+            TASK = "converted_within_90_days"
+            TRAP = "total_touches_all"
+
+            with (BUNDLE / "manifest.json").open() as fh:
+                manifest = json.load(fh)
+            assert manifest["exposure_mode"] == "student_public"
+            assert manifest["relational_snapshot_safe"] is True
+            SNAPSHOT_DAY = int(manifest["snapshot_day"])
+            HORIZON_DAYS = int(manifest["horizon_days"])
+            print(f"snapshot_day = {SNAPSHOT_DAY}   horizon_days = {HORIZON_DAYS}")
+            """
+        ),
+        md(
+            """
+            ## 2. The trap, as the feature dictionary calls it out
+
+            The release ships a `feature_dictionary.csv` next to the
+            data. Any column with `leakage_risk = True` is flagged as a
+            **deliberate teaching trap** — included so users can practise
+            detecting it, with the trap's nature documented inline.
+
+            Treat the feature dictionary as the first place you look on
+            any new dataset. A column named `total_touches_all` is not
+            obviously bad until the dictionary tells you it counts
+            touches over the full 90-day horizon, well past the
+            30-day snapshot anchor that defines the prediction time.
+            """
+        ),
+        code(
+            """
+            feat_dict = pd.read_csv(BUNDLE / "feature_dictionary.csv")
+            traps = feat_dict[feat_dict["leakage_risk"].astype(bool)]
+            print(f"trap columns flagged in feature_dictionary.csv: {len(traps)}")
+            for _, row in traps.iterrows():
+                print(f"  {row['name']}: {row['description']}")
+            assert TRAP in set(traps["name"]), f"{TRAP} expected to be flagged in dictionary"
+            """
+        ),
+        md(
+            """
+            ## 3. Time-window proof — the trap *by construction*
+
+            The dictionary *says* `total_touches_all` uses post-snapshot
+            data. We verify that on the same row that carries the trap:
+            the task table also carries `touch_count`, the
+            **snapshot-safe** touch aggregate (filtered to
+            `touch_timestamp <= lead_created_at + snapshot_day`). Their
+            difference is the **post-snapshot delta** — by construction,
+            information from days 31–90 that the model should never see
+            when scoring at day 30.
+
+            The pedagogical point is independent of how predictive that
+            difference turns out to be. **A column that uses
+            post-snapshot data is invalid at scoring time even when it
+            looks unpredictive in isolation.** Section 4 measures that
+            "looks unpredictive in isolation" claim directly, then
+            section 5 shows it can be misleading.
+
+            We pool all three task splits so the receipt covers every
+            lead in the bundle.
+            """
+        ),
+        code(
+            """
+            train = pd.read_parquet(BUNDLE / "tasks" / TASK / "train.parquet")
+            valid = pd.read_parquet(BUNDLE / "tasks" / TASK / "valid.parquet")
+            test = pd.read_parquet(BUNDLE / "tasks" / TASK / "test.parquet")
+
+            all_leads = pd.concat([train, valid, test], ignore_index=True)
+            assert all_leads["lead_id"].is_unique, (
+                "expected one row per lead across train/valid/test"
+            )
+
+            window = all_leads[["lead_id", TRAP, "touch_count", TASK]].copy()
+            window[TRAP] = pd.to_numeric(window[TRAP], errors="coerce")
+            window["touch_count"] = pd.to_numeric(window["touch_count"], errors="coerce")
+            window = window.dropna(subset=[TRAP, "touch_count"]).copy()
+            window["post_snapshot_touches"] = window[TRAP] - window["touch_count"]
+            window[TASK] = window[TASK].astype("boolean").fillna(False).astype(int)
+
+            print(f"leads used in this section: {len(window):,}")
+            print(
+                f"  {TRAP:<22s}      mean={window[TRAP].mean():6.2f}  "
+                f"max={int(window[TRAP].max()):>4d}"
+            )
+            print(
+                f"  {'touch_count (snapshot-safe)':<22s}  "
+                f"mean={window['touch_count'].mean():6.2f}  "
+                f"max={int(window['touch_count'].max()):>4d}"
+            )
+            mean_delta = float(window["post_snapshot_touches"].mean())
+            n_post = int((window["post_snapshot_touches"] > 0).sum())
+            print(
+                f"  {'post-snapshot delta':<22s}      "
+                f"mean={mean_delta:6.2f}  "
+                f"max={int(window['post_snapshot_touches'].max()):>4d}"
+            )
+            print(
+                f"  → {n_post:,} of {len(window):,} leads "
+                f"({n_post / len(window):.1%}) have a positive post-snapshot delta"
+            )
+            assert mean_delta > 0, (
+                "expected a positive mean post-snapshot delta — if zero, the trap may "
+                "have been silently rebuilt as a snapshot-safe aggregate"
+            )
+            """
+        ),
+        md(
+            """
+            ### 3.1 The post-snapshot delta is uncorrelated with the label *on this dataset*
+
+            On the v1 procurement world, the count of touches between
+            day 30 and day 90 turns out to be roughly the same for
+            converted and non-converted leads — sales reps keep working
+            both groups for a while before the funnel settles. A
+            stronger world (more aggressive sales follow-up on hot
+            leads) would split these apart; this one doesn't.
+
+            The plot below makes that lack-of-split visible. The trap
+            is *still a trap* — we just can't tell that from the
+            post-snapshot delta alone, which is why the validation
+            report's `post_snapshot_aggregates` baseline (a single-
+            column probe) gives an AUC of only ~0.55. The real damage
+            shows up when a tree model gets to combine the trap with
+            other columns; section 5 measures that.
+            """
+        ),
+        code(
+            """
+            grouped = window.groupby(TASK)["post_snapshot_touches"].agg(["mean", "median", "count"])
+            grouped.index = grouped.index.map({0: "non-converted", 1: "converted"})
+            print(grouped)
+
+            fig, ax = plt.subplots(figsize=(6, 4))
+            data = [
+                window.loc[window[TASK] == 0, "post_snapshot_touches"],
+                window.loc[window[TASK] == 1, "post_snapshot_touches"],
+            ]
+            ax.boxplot(data, tick_labels=["non-converted", "converted"], showfliers=False)
+            ax.set_ylabel("post-snapshot touches (total_touches_all − touch_count)")
+            ax.set_title("Post-snapshot delta by label\\n(roughly the same — section 5 explains why this is misleading)")
+            plt.tight_layout()
+            plt.show()
+            """
+        ),
+        md(
+            """
+            ## 4. Standalone-AUC probe (the audit that almost lets the trap pass)
+
+            A common leakage audit is to fit a one-feature classifier on
+            each suspect column and report the standalone AUC. The
+            validation report does this at scale — its
+            `post_snapshot_aggregates` baseline trains a model on the
+            single column `total_touches_all` and reports an AUC around
+            0.55. That sounds tame, and on a busy schedule it's tempting
+            to clear the column on those grounds.
+
+            We re-run the probe here so you've seen the number with your
+            own eyes: ~0.53. If that's all you measure, the trap looks
+            barely worth mentioning. Section 5 shows what that audit
+            misses.
+            """
+        ),
+        code(
+            """
+            standalone_window = window.dropna(subset=[TRAP, "touch_count"]).copy()
+            y = standalone_window[TASK].astype(int).to_numpy()
+            standalone = {
+                TRAP: float(roc_auc_score(y, standalone_window[TRAP].to_numpy())),
+                "touch_count (snapshot-safe)": float(
+                    roc_auc_score(y, standalone_window["touch_count"].to_numpy())
+                ),
+                "post-snapshot delta": float(
+                    roc_auc_score(y, standalone_window["post_snapshot_touches"].to_numpy())
+                ),
+            }
+            print(f"{'feature':<32s}  {'standalone AUC':>16s}")
+            for name, auc in standalone.items():
+                print(f"  {name:<30s}  {auc:>16.4f}")
+            """
+        ),
+        md(
+            """
+            ## 5. Side-by-side AUC: full panel ± trap
+
+            Train two HistGBM and two Logistic Regression baselines on
+            the **same train/test split, same model, same seed** —
+            the only thing that varies is whether `total_touches_all`
+            is in the column list.
+
+            We use the full as-shipped feature panel (every public
+            snapshot column except IDs / label) as the baseline. This
+            mirrors notebook 01 / the validation report's setup, so the
+            with-trap AUC reproduces the report's published number and
+            the without-trap AUC is what notebook 02 starts from.
+            """
+        ),
+        code(
+            """
+            ID_COLS = ["account_id", "contact_id", "lead_id", "lead_created_at"]
+            EXCLUDE = set(ID_COLS + [TASK])
+
+            full_cols = [c for c in train.columns if c not in EXCLUDE]
+            full_cols_no_trap = [c for c in full_cols if c != TRAP]
+            print(f"full panel:         {len(full_cols)} cols (incl. {TRAP})")
+            print(f"full panel no trap: {len(full_cols_no_trap)} cols")
+
+            def _split_cols(df: pd.DataFrame, cols: list[str]) -> tuple[list[str], list[str]]:
+                cat = [
+                    c
+                    for c in cols
+                    if not (
+                        pd.api.types.is_bool_dtype(df[c])
+                        or pd.api.types.is_numeric_dtype(df[c])
+                    )
+                ]
+                num = [c for c in cols if c not in cat]
+                return num, cat
+
+            def _sanitize(df: pd.DataFrame, cat_cols: list[str]) -> pd.DataFrame:
+                out = df.copy()
+                for c in cat_cols:
+                    out[c] = out[c].astype(object).where(out[c].notna(), None)
+                return out
+
+            def _build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline:
+                num_t = Pipeline(
+                    [
+                        ("imputer", SimpleImputer(strategy="median")),
+                        ("scaler", StandardScaler()),
+                    ]
+                )
+                cat_t = Pipeline(
+                    [
+                        ("imputer", SimpleImputer(strategy="most_frequent")),
+                        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
+                    ]
+                )
+                pre = ColumnTransformer(
+                    [("num", num_t, num_cols), ("cat", cat_t, cat_cols)],
+                    remainder="drop",
+                )
+                if model == "lr":
+                    clf = LogisticRegression(max_iter=1000, solver="lbfgs", random_state=SEED)
+                else:
+                    clf = HistGradientBoostingClassifier(random_state=SEED)
+                return Pipeline([("preprocessor", pre), ("classifier", clf)])
+
+            def fit_score(cols: list[str], *, model: str) -> np.ndarray:
+                num_cols, cat_cols = _split_cols(train, cols)
+                pipe = _build_pipeline(num_cols, cat_cols, model=model)
+                pipe.fit(_sanitize(train[cols], cat_cols), y_train)
+                return pipe.predict_proba(_sanitize(test[cols], cat_cols))[:, 1]
+
+            y_train = train[TASK].astype("boolean").fillna(False).astype(int).to_numpy()
+            y_test = test[TASK].astype("boolean").fillna(False).astype(int).to_numpy()
+            base_rate = float(y_test.mean())
+            print(f"test base rate: {base_rate:.3f}")
+            """
+        ),
+        code(
+            """
+            results: dict[str, dict[str, float]] = {}
+            for model in ("lr", "gbm"):
+                p_with = fit_score(full_cols, model=model)
+                p_without = fit_score(full_cols_no_trap, model=model)
+                results[model] = {
+                    "with_trap_auc":    float(roc_auc_score(y_test, p_with)),
+                    "without_trap_auc": float(roc_auc_score(y_test, p_without)),
+                    "with_trap_p100":   precision_at_k(p_with, y_test, 100),
+                    "without_trap_p100": precision_at_k(p_without, y_test, 100),
+                }
+
+            print(f"{'model':<5s}  {'with trap':>10s}  {'without trap':>13s}  {'Δ AUC':>8s}")
+            for m, r in results.items():
+                d = r["with_trap_auc"] - r["without_trap_auc"]
+                print(
+                    f"{m:<5s}  {r['with_trap_auc']:>10.4f}  "
+                    f"{r['without_trap_auc']:>13.4f}  {d:+8.4f}"
+                )
+            """
+        ),
+        md(
+            """
+            ### 5.1 The lesson — standalone AUC underestimates trap impact
+
+            Section 4 says `total_touches_all` is barely above chance
+            (~0.53 AUC) on its own. Section 5 says HistGBM extracts a
+            sizeable lift (~+0.03 AUC) from the same column once it can
+            combine it with the rest of the feature panel. Both
+            measurements are correct; they just measure different things.
+
+            **Why the gap?** A standalone-AUC probe asks *can this
+            column rank leads when it's the only signal you have?* A
+            tree model with the rest of the panel already in scope asks
+            *can this column refine my existing splits?* The trap's
+            post-snapshot information correlates with other columns
+            non-linearly — a few late touches by an outbound rep on an
+            engaged-but-not-yet-converted lead is a very different
+            signal from the same touches on a cold lead — and the
+            tree can carve the join, while a single-feature probe
+            never sees it. The Logistic Regression gain is much smaller
+            (~+0.01) for the same reason: it cannot represent that
+            interaction structure.
+
+            Bar chart below highlights the asymmetry.
+            """
+        ),
+        code(
+            """
+            labels = ["GBM full", "LR full"]
+            deltas = [
+                results["gbm"]["with_trap_auc"] - results["gbm"]["without_trap_auc"],
+                results["lr"]["with_trap_auc"]  - results["lr"]["without_trap_auc"],
+            ]
+            colors = ["#3b82f6", "#9ca3af"]
+            fig, ax = plt.subplots(figsize=(6, 4))
+            ax.bar(range(len(labels)), deltas, color=colors)
+            ax.axhline(0.0, color="#1f2937", linewidth=0.8)
+            ax.axhline(
+                standalone[TRAP] - 0.5,
+                color="#ef4444",
+                linestyle="--",
+                label=f"standalone-AUC excess ({standalone[TRAP] - 0.5:+.3f})",
+            )
+            ax.set_xticks(range(len(labels)))
+            ax.set_xticklabels(labels)
+            ax.set_ylabel("ΔAUC = with_trap − without_trap")
+            ax.set_title("Trap impact — tree models extract more than the probe predicts")
+            ax.legend(loc="best", fontsize=8)
+            plt.tight_layout()
+            plt.show()
+            """
+        ),
+        md(
+            """
+            ## 6. Tolerance gate (G13.2)
+
+            Single-seed (seed=42) AUCs and trap deltas observed on the
+            as-shipped intermediate bundle. Tolerances pin each AUC to
+            within ±0.02 (well outside numerical jitter, well inside the
+            band that would hide a regression). The sign-aware
+            assertion below makes the pedagogical claim load-bearing:
+            if a regeneration ever neutralises the GBM trap-delta, this
+            fails — even if the absolute AUCs stay inside their bands.
+            """
+        ),
+        code(
+            """
+            NB03_TARGETS = {
+                "lr_with_trap_auc":     0.8827,
+                "lr_without_trap_auc":  0.8737,
+                "gbm_with_trap_auc":    0.8754,
+                "gbm_without_trap_auc": 0.8432,
+                "trap_standalone_auc":  0.5310,
+            }
+            NB03_TOLERANCES = dict.fromkeys(NB03_TARGETS, 0.02)
+
+            observed = {
+                "lr_with_trap_auc":     results["lr"]["with_trap_auc"],
+                "lr_without_trap_auc":  results["lr"]["without_trap_auc"],
+                "gbm_with_trap_auc":    results["gbm"]["with_trap_auc"],
+                "gbm_without_trap_auc": results["gbm"]["without_trap_auc"],
+                "trap_standalone_auc":  standalone[TRAP],
+            }
+            assert_within_tolerance(
+                observed=observed,
+                target=NB03_TARGETS,
+                tolerances=NB03_TOLERANCES,
+                label="notebook 03 trap-panel AUCs (seed 42, intermediate)",
+            )
+
+            # Sign-aware: GBM must extract a meaningful lift from the
+            # trap.  Threshold sits well below the seed-42 observation
+            # (~+0.032) but well above LR's +0.009, so it specifically
+            # guards the tree-model lift the section-5 narrative claims.
+            MIN_GBM_LIFT = 0.015
+            gbm_lift = (
+                results["gbm"]["with_trap_auc"] - results["gbm"]["without_trap_auc"]
+            )
+            assert gbm_lift > MIN_GBM_LIFT, (
+                f"GBM trap-lift collapsed: {gbm_lift:+.4f} <= {MIN_GBM_LIFT:.4f} — "
+                "the trap is no longer carrying the pedagogical lesson in section 5"
+            )
+            print("OK — trap-panel AUCs in tolerance and GBM lift positive.")
+            """
+        ),
+        md(
+            """
+            ## 7. A detection recipe you can run on any dataset
+
+            The trap was easy to spot here because the dataset
+            *advertises* it. On a third-party dataset you don't get
+            that courtesy. The same recipe still works:
+
+            1. **Read any feature dictionary you have.** Any column
+               whose description references a window longer than the
+               prediction horizon is suspicious. Even when no
+               dictionary ships, an obvious naming smell (`*_total`,
+               `*_all`, `*_lifetime`) on a 30-day-snapshot dataset is a
+               flag.
+            2. **Probe the standalone AUC** *and* **the contribution to
+               a tree model.** A standalone probe alone undersells
+               tree-friendly leakage (sections 4 and 5 demonstrate why
+               on this dataset). Train a model with the column, train
+               another without, and compare. The ablation captures
+               interactions the standalone probe can't.
+            3. **Inspect the time window.** Cross-check the suspect
+               column against any time-stamped event tables. If the
+               column's value can only be explained by events past the
+               snapshot anchor, you've found a trap. Section 3 makes
+               this concrete here — the same technique generalises
+               anywhere there's an event table to corroborate.
+
+            A walkthrough of additional detection patterns
+            (column-name heuristics, isolation-via-residuals,
+            target-encoding leakage on test) lives in
+            `docs/release/break_me_guide.md` (coming in PR 6.3) — pair
+            it with this notebook for a more complete playbook.
+
+            ## Next
+
+            - **Notebook 04** — value-aware ranking
+              (`expected_acv` × P(convert)), calibration plots,
+              threshold selection for top-K capacity, and a
+              cohort-shift / bootstrap robustness harness.
+            """
+        ),
+    ]
+
+
+def main() -> None:
+    args = builder_arg_parser(
+        default_out=DEFAULT_OUT,
+        description="Build release/notebooks/03_leakage_and_time_windows.ipynb",
+    ).parse_args()
+    write_notebook(args.out, assemble_notebook(cells()))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/build_release_notebook_04.py b/scripts/build_release_notebook_04.py
new file mode 100644
index 0000000..0c6dbb0
--- /dev/null
+++ b/scripts/build_release_notebook_04.py
@@ -0,0 +1,822 @@
+"""One-shot builder for ``release/notebooks/04_lift_calibration_value_ranking.ipynb``.
+
+Run from the repository root::
+
+    python scripts/build_release_notebook_04.py
+
+Cells are assigned deterministic IDs by ``_release_notebook_common`` so
+re-running yields a byte-identical file — same audit-artifact-sync
+pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts.
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent))
+
+import nbformat as nbf  # noqa: E402 — must follow sys.path insert
+from _release_notebook_common import (  # noqa: E402 — must follow sys.path insert
+    assemble_notebook,
+    builder_arg_parser,
+    code,
+    md,
+    write_notebook,
+)
+
+DEFAULT_OUT = (
+    Path(__file__).resolve().parents[1]
+    / "release"
+    / "notebooks"
+    / "04_lift_calibration_value_ranking.ipynb"
+)
+
+
+def cells() -> list[nbf.NotebookNode]:
+    return [
+        md(
+            """
+            # Notebook 04 — Lift, Calibration, Value-Aware Ranking
+
+            **Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.
+
+            AUC ranks well; that doesn't mean it ranks *for the right
+            thing*. Sales teams care about three additional concerns
+            AUC alone never tells you about:
+
+            1. **Calibration.** Are predicted probabilities trustworthy
+               as point estimates, or just as a ranking?
+            2. **Value-aware ranking.** A 30 %-likely lead worth $200K
+               is more valuable than a 60 %-likely one worth $20K.
+               Ranking by P(convert) wastes ACV; ranking by
+               P(convert) × `expected_acv` doesn't.
+            3. **Robustness.** Does the model still work next quarter
+               (cohort shift)? How tight is the metric on the test set
+               you have (bootstrap)?
+
+            We answer all three on the public bundle, plus a threshold-
+            selection walkthrough that maps a fixed sales-capacity
+            constraint to an operating point. The notebook closes with
+            a tolerance gate that pins the cohort-shift result to the
+            published validation report — if a regeneration ever
+            silently changes the cohort-degradation behaviour, CI
+            catches it.
+
+            **Public path discipline (G13.3).** Loads only
+            `release/intermediate/` (the public student bundle).
+            Instructor-only artefacts (the latent registry, full-horizon
+            event tables, hidden DAG) are never read.
+
+            **Trap discipline.** The headline LR / GBM panel drops
+            `total_touches_all` (per notebook 02's leakage discipline)
+            so the metrics it reports are honest production numbers.
+            The cohort-shift section deliberately *keeps* the trap to
+            reproduce the validation report's cohort-shift block — the
+            report's panel is the as-shipped one, and we want a
+            comparable number, not a cleaner one.
+            """
+        ),
+        md("## 1. Setup"),
+        code(
+            """
+            from __future__ import annotations
+
+            import json
+            import sys
+            from pathlib import Path
+
+            import matplotlib.pyplot as plt
+            import numpy as np
+            import pandas as pd
+            from sklearn.compose import ColumnTransformer
+            from sklearn.ensemble import HistGradientBoostingClassifier
+            from sklearn.impute import SimpleImputer
+            from sklearn.linear_model import LogisticRegression
+            from sklearn.metrics import (
+                average_precision_score,
+                brier_score_loss,
+                roc_auc_score,
+            )
+            from sklearn.pipeline import Pipeline
+            from sklearn.preprocessing import OneHotEncoder, StandardScaler
+
+            sys.path.insert(0, str(Path.cwd()))
+            from _notebook_utils import assert_within_tolerance
+
+            SEED = 42
+            BUNDLE = Path("../intermediate")          # public student bundle
+            TASK = "converted_within_90_days"
+            TRAP = "total_touches_all"
+
+            with (BUNDLE / "manifest.json").open() as fh:
+                manifest = json.load(fh)
+            assert manifest["exposure_mode"] == "student_public"
+            assert manifest["relational_snapshot_safe"] is True
+
+            train = pd.read_parquet(BUNDLE / "tasks" / TASK / "train.parquet")
+            test = pd.read_parquet(BUNDLE / "tasks" / TASK / "test.parquet")
+            print(f"train rows: {len(train):,}")
+            print(f"test rows:  {len(test):,}")
+            """
+        ),
+        md(
+            """
+            ## 2. Train the headline LR + GBM panel
+
+            Same preprocessing as notebooks 01 / 02 (mirrors
+            `leadforge.validation.release_quality._build_pipeline`).
+            We drop the documented leakage trap `total_touches_all`
+            here so the calibration / lift / value plots in sections
+            3–6 reflect a honest production model. The cohort-shift
+            section in section 7 uses the validator's full-panel
+            posture (trap kept) so its number is comparable to the
+            published validation report.
+            """
+        ),
+        code(
+            """
+            ID_COLS = ["account_id", "contact_id", "lead_id", "lead_created_at"]
+            EXCLUDE_HEADLINE = set(ID_COLS + [TASK, TRAP])
+            headline_cols = [c for c in train.columns if c not in EXCLUDE_HEADLINE]
+            cat_cols = [
+                c
+                for c in headline_cols
+                if not (
+                    pd.api.types.is_bool_dtype(train[c])
+                    or pd.api.types.is_numeric_dtype(train[c])
+                )
+            ]
+            num_cols = [c for c in headline_cols if c not in cat_cols]
+            print(f"headline panel: {len(headline_cols)} cols (trap dropped)")
+
+            def _sanitize(df: pd.DataFrame, cats: list[str]) -> pd.DataFrame:
+                out = df.copy()
+                for c in cats:
+                    out[c] = out[c].astype(object).where(out[c].notna(), None)
+                return out
+
+            def build_pipeline(num: list[str], cat: list[str], *, model: str) -> Pipeline:
+                pre = ColumnTransformer(
+                    [
+                        (
+                            "num",
+                            Pipeline(
+                                [
+                                    ("imputer", SimpleImputer(strategy="median")),
+                                    ("scaler", StandardScaler()),
+                                ]
+                            ),
+                            num,
+                        ),
+                        (
+                            "cat",
+                            Pipeline(
+                                [
+                                    ("imputer", SimpleImputer(strategy="most_frequent")),
+                                    (
+                                        "encoder",
+                                        OneHotEncoder(
+                                            handle_unknown="ignore", sparse_output=False
+                                        ),
+                                    ),
+                                ]
+                            ),
+                            cat,
+                        ),
+                    ],
+                    remainder="drop",
+                )
+                clf = (
+                    LogisticRegression(max_iter=1000, solver="lbfgs", random_state=SEED)
+                    if model == "lr"
+                    else HistGradientBoostingClassifier(random_state=SEED)
+                )
+                return Pipeline([("preprocessor", pre), ("classifier", clf)])
+
+            y_train = train[TASK].astype("boolean").fillna(False).astype(int).to_numpy()
+            y_test = test[TASK].astype("boolean").fillna(False).astype(int).to_numpy()
+            base_rate = float(y_test.mean())
+
+            x_train = _sanitize(train[headline_cols], cat_cols)
+            x_test = _sanitize(test[headline_cols], cat_cols)
+
+            lr_pipe = build_pipeline(num_cols, cat_cols, model="lr").fit(x_train, y_train)
+            gbm_pipe = build_pipeline(num_cols, cat_cols, model="gbm").fit(x_train, y_train)
+            lr_probs = lr_pipe.predict_proba(x_test)[:, 1]
+            gbm_probs = gbm_pipe.predict_proba(x_test)[:, 1]
+
+            print(f"  base rate: {base_rate:.3f}")
+            print(f"  LR   AUC: {roc_auc_score(y_test, lr_probs):.4f}   "
+                  f"AP: {average_precision_score(y_test, lr_probs):.4f}   "
+                  f"Brier: {brier_score_loss(y_test, lr_probs):.4f}")
+            print(f"  GBM  AUC: {roc_auc_score(y_test, gbm_probs):.4f}   "
+                  f"AP: {average_precision_score(y_test, gbm_probs):.4f}   "
+                  f"Brier: {brier_score_loss(y_test, gbm_probs):.4f}")
+            """
+        ),
+        md(
+            """
+            ## 3. Calibration / reliability diagram
+
+            Bin LR's predicted probabilities into ten equal-width
+            buckets, plot mean predicted vs mean observed. A perfectly
+            calibrated model lies on the diagonal; LR after
+            `StandardScaler + LogisticRegression` is usually close.
+            We also surface `max_bin_error` — the worst gap across
+            non-empty bins — which the validation report tracks
+            (`tiers.intermediate.medians.calibration_max_bin_error`).
+            """
+        ),
+        code(
+            """
+            edges = np.linspace(0.0, 1.0, 11)
+            mean_pred: list[float] = []
+            mean_actual: list[float] = []
+            bin_n: list[int] = []
+            for i in range(10):
+                lo, hi = edges[i], edges[i + 1]
+                mask = (lr_probs >= lo) & (
+                    (lr_probs <= hi) if i == 9 else (lr_probs < hi)
+                )
+                if mask.sum() == 0:
+                    continue
+                mean_pred.append(float(lr_probs[mask].mean()))
+                mean_actual.append(float(y_test[mask].mean()))
+                bin_n.append(int(mask.sum()))
+
+            max_bin_err = max(
+                abs(p - a) for p, a in zip(mean_pred, mean_actual, strict=False)
+            )
+            print(f"max bin error (LR): {max_bin_err:.4f}")
+            for p, a, n in zip(mean_pred, mean_actual, bin_n, strict=False):
+                print(f"  pred={p:.3f}  actual={a:.3f}  n={n:>4d}")
+
+            fig, ax = plt.subplots(figsize=(5, 5))
+            ax.plot([0, 1], [0, 1], color="#9ca3af", linestyle="--", label="perfect calibration")
+            ax.plot(mean_pred, mean_actual, marker="o", color="#3b82f6", label="LR")
+            ax.set_xlim(0, 1)
+            ax.set_ylim(0, 1)
+            ax.set_xlabel("Mean predicted probability")
+            ax.set_ylabel("Observed conversion rate")
+            ax.set_title("Calibration — LR, intermediate tier (seed 42)")
+            ax.legend(loc="upper left")
+            plt.tight_layout()
+            plt.show()
+            """
+        ),
+        md(
+            """
+            ## 4. Lift and cumulative gains
+
+            Two complementary curves:
+
+            * **Cumulative gains** — fraction of positives captured as
+              you sweep the score threshold. Top 10 % of the ranked
+              list captures ~26 % of converted leads on this seed (vs
+              the 10 % a random ranker would catch).
+            * **Lift at *k* %** — `top_k_conversion_rate / base_rate`.
+              Lift = 2 means "the top 1 % of leads convert at twice
+              the base rate."
+
+            Both metrics are in `release/validation/validation_report.json`
+            (`per_seed[0].cumulative_gains` and `per_seed[0].lift_at_pct`)
+            so the reproduction is auditable.
+            """
+        ),
+        code(
+            """
+            order = np.argsort(-lr_probs, kind="stable")
+            y_sorted = y_test[order]
+            n = len(y_test)
+            n_pos = int(y_test.sum())
+
+            # Cumulative gains: fraction of positives captured by top-pct.
+            pcts = np.arange(0, 101, 10)
+            gains = []
+            for pct in pcts:
+                k = max(1, int(round(n * pct / 100.0)))
+                if pct == 0:
+                    gains.append(0.0)
+                else:
+                    gains.append(float(y_sorted[:k].sum() / n_pos))
+
+            # Lift at 1 / 5 / 10 %.
+            lifts = {}
+            for pct in [1.0, 5.0, 10.0]:
+                k = max(1, int(round(n * pct / 100.0)))
+                lifts[pct] = float(y_sorted[:k].mean() / base_rate)
+
+            for pct, lift in lifts.items():
+                print(f"  lift @ top {pct:>4.0f}%: {lift:.3f}x")
+
+            fig, axes = plt.subplots(1, 2, figsize=(11, 4))
+            axes[0].plot(pcts, gains, marker="o", color="#3b82f6", label="LR")
+            axes[0].plot([0, 100], [0, 1], color="#9ca3af", linestyle="--", label="random")
+            axes[0].set_xlabel("Top-pct of ranked leads")
+            axes[0].set_ylabel("Cumulative conversion capture")
+            axes[0].set_title("Cumulative gains")
+            axes[0].legend(loc="lower right")
+
+            axes[1].bar(
+                [str(int(p)) for p in lifts],
+                list(lifts.values()),
+                color="#3b82f6",
+            )
+            axes[1].axhline(1.0, color="#ef4444", linestyle="--", label="random (lift=1)")
+            axes[1].set_xlabel("Top-pct of ranked leads")
+            axes[1].set_ylabel("Lift over base rate")
+            axes[1].set_title("Lift at top-pct")
+            axes[1].legend()
+            plt.tight_layout()
+            plt.show()
+            """
+        ),
+        md(
+            """
+            ## 5. Value-aware ranking — `expected_acv` × P(convert)
+
+            Sales reps don't have infinite capacity, so the right
+            objective is rarely "maximise conversion count" — it's
+            "maximise revenue captured per outreach slot." The bundle
+            ships an `expected_acv` column (opportunity ACV when
+            available, else revenue-band midpoint heuristic) which
+            makes value-aware ranking trivial:
+
+            $$ \\text{score}_\\text{value} = P(\\text{convert}) \\times
+            \\text{expected\\_acv} $$
+
+            We compare two top-K policies — rank by P(convert) only
+            vs rank by score_value — and report
+            `expected_acv_capture_at_k = sum(acv * y) over top-K /
+            sum(acv * y) over the whole test`. The validation report's
+            `per_seed[0].expected_acv_capture_at_k` is the reference.
+            """
+        ),
+        code(
+            """
+            acv = pd.to_numeric(test["expected_acv"], errors="coerce").fillna(0.0).to_numpy()
+            value_score = lr_probs * acv
+
+            def acv_capture(scores: np.ndarray, k: int) -> float:
+                order = np.argsort(-scores, kind="stable")
+                captured = float(np.sum(acv[order[:k]] * y_test[order[:k]]))
+                total = float(np.sum(acv * y_test))
+                return captured / total if total > 0 else float("nan")
+
+            print(f"{'top-K':<6s}  {'cap by P(conv)':>14s}  {'cap by P×ACV':>13s}  {'gain':>7s}")
+            value_gains = {}
+            for k in (50, 100, 200):
+                cap_p = acv_capture(lr_probs, k)
+                cap_v = acv_capture(value_score, k)
+                value_gains[k] = cap_v - cap_p
+                print(f"  top {k:<3d}  {cap_p:>14.4f}  {cap_v:>13.4f}  {cap_v - cap_p:+7.4f}")
+
+            # Plot side-by-side ACV capture for K in 10..300.
+            ks = np.arange(10, 301, 10)
+            cap_p = [acv_capture(lr_probs, int(k)) for k in ks]
+            cap_v = [acv_capture(value_score, int(k)) for k in ks]
+            fig, ax = plt.subplots(figsize=(7, 4))
+            ax.plot(ks, cap_p, marker="o", color="#9ca3af", label="rank by P(convert)")
+            ax.plot(ks, cap_v, marker="o", color="#3b82f6", label="rank by P(convert)×ACV")
+            ax.set_xlabel("top-K leads contacted")
+            ax.set_ylabel("Fraction of converted-ACV captured")
+            ax.set_title("Value-aware ranking captures more revenue per outreach slot")
+            ax.legend(loc="lower right")
+            plt.tight_layout()
+            plt.show()
+            """
+        ),
+        md(
+            """
+            ## 6. Threshold selection for fixed top-K capacity
+
+            Sales rarely has the patience for "score everything, run
+            stats." The realistic ask is: *"My team can work 50 leads
+            this week. Set a probability threshold that selects ~50
+            from the test population."*
+
+            We sweep the probability threshold across the LR score
+            distribution and report precision / recall / count above
+            threshold for each step, then pick the threshold whose
+            count is closest to the requested capacity.
+            """
+        ),
+        code(
+            """
+            CAPACITY = 50
+
+            sorted_probs = np.sort(lr_probs)[::-1]
+            # The K-th highest probability is the smallest threshold that
+            # admits exactly K leads (ties resolved by score order).
+            threshold = float(sorted_probs[CAPACITY - 1])
+            mask = lr_probs >= threshold
+            n_above = int(mask.sum())
+            prec = float(y_test[mask].mean()) if n_above > 0 else float("nan")
+            recall = float(y_test[mask].sum() / max(int(y_test.sum()), 1))
+            print(
+                f"capacity={CAPACITY}  threshold={threshold:.3f}  "
+                f"actually_above={n_above}  precision={prec:.3f}  recall={recall:.3f}"
+            )
+
+            # Threshold sweep — show what happens around the operating
+            # point so the threshold choice is informed, not magic.
+            thresholds = np.linspace(
+                float(np.quantile(lr_probs, 0.5)), float(np.max(lr_probs)), 30
+            )
+            counts = [int((lr_probs >= t).sum()) for t in thresholds]
+            precs = [
+                float(y_test[lr_probs >= t].mean()) if (lr_probs >= t).sum() > 0 else 0.0
+                for t in thresholds
+            ]
+
+            fig, axes = plt.subplots(1, 2, figsize=(11, 4))
+            axes[0].plot(thresholds, counts, marker="o", color="#3b82f6")
+            axes[0].axhline(CAPACITY, color="#ef4444", linestyle="--", label=f"capacity={CAPACITY}")
+            axes[0].axvline(threshold, color="#10b981", linestyle="--", label=f"chosen ({threshold:.3f})")
+            axes[0].set_xlabel("threshold")
+            axes[0].set_ylabel("# leads above threshold")
+            axes[0].set_title("Threshold sweep — count above")
+            axes[0].legend()
+
+            axes[1].plot(thresholds, precs, marker="o", color="#3b82f6")
+            axes[1].axhline(base_rate, color="#9ca3af", linestyle="--", label=f"base rate ({base_rate:.3f})")
+            axes[1].axvline(threshold, color="#10b981", linestyle="--", label=f"chosen ({threshold:.3f})")
+            axes[1].set_xlabel("threshold")
+            axes[1].set_ylabel("precision above threshold")
+            axes[1].set_title("Threshold sweep — precision above")
+            axes[1].legend()
+            plt.tight_layout()
+            plt.show()
+            """
+        ),
+        md(
+            """
+            ## 7. Cohort-shift evaluation
+
+            The bundle's train/test split is a uniform random split of
+            leads. A more realistic stress test is "train on the first
+            70 % of leads chronologically, score the last 30 % "—
+            because in production you always have to predict the
+            *future*, never a held-out random sample of the past.
+
+            We mirror the validator's cohort-shift logic
+            (`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`):
+            pool train + test, sort by `lead_created_at` with `lead_id`
+            as a stable tiebreak, train HistGBM on the first 85 % and
+            score the last 15 % (the validator's `COHORT_TRAIN_FRAC`).
+            Both random and cohort splits use the full feature panel
+            **including** the trap, matching the report's posture so
+            the numbers compare directly. The HistGBM uses
+            `random_state=0` here (the validator's
+            `DEFAULT_MODEL_RANDOM_STATE`) instead of the notebook's
+            default `SEED=42` — that matters for the cohort-shift
+            reproduction down to the third decimal.
+
+            The expected behaviour for the v1 intermediate tier is
+            *no* degradation — the report shows the cohort split AUC
+            running ~0.015 *higher* than the random split. That's a
+            surprise worth surfacing: the v1 simulator's intermediate
+            world doesn't drift over its 90-day horizon, so cohort
+            order isn't a stressor here. The intro and advanced
+            tiers show small positive degradations (intro +0.016,
+            advanced +0.010) — see
+            `release/validation/validation_report.json` ⇒
+            `cohort_shift`.
+            """
+        ),
+        code(
+            """
+            # Constants mirror leadforge.validation.release_quality so
+            # the numbers reproduce the report's cohort-shift block.
+            COHORT_TRAIN_FRAC = 0.85
+            COHORT_MODEL_SEED = 0
+
+            # Cohort-shift uses the validator's full panel (trap kept).
+            EXCLUDE_FULL = set(ID_COLS + [TASK])
+            full_cols = [c for c in train.columns if c not in EXCLUDE_FULL]
+            cat_full = [
+                c
+                for c in full_cols
+                if not (
+                    pd.api.types.is_bool_dtype(train[c])
+                    or pd.api.types.is_numeric_dtype(train[c])
+                )
+            ]
+            num_full = [c for c in full_cols if c not in cat_full]
+
+            def _gbm_pipeline_for_cohort() -> Pipeline:
+                # Local builder so the validator's ``model_random_state=0``
+                # is used here, while the headline panel above keeps
+                # ``random_state=SEED`` for the section-2 LR/GBM models.
+                pre = ColumnTransformer(
+                    [
+                        (
+                            "num",
+                            Pipeline(
+                                [
+                                    ("imputer", SimpleImputer(strategy="median")),
+                                    ("scaler", StandardScaler()),
+                                ]
+                            ),
+                            num_full,
+                        ),
+                        (
+                            "cat",
+                            Pipeline(
+                                [
+                                    ("imputer", SimpleImputer(strategy="most_frequent")),
+                                    (
+                                        "encoder",
+                                        OneHotEncoder(
+                                            handle_unknown="ignore", sparse_output=False
+                                        ),
+                                    ),
+                                ]
+                            ),
+                            cat_full,
+                        ),
+                    ],
+                    remainder="drop",
+                )
+                clf = HistGradientBoostingClassifier(random_state=COHORT_MODEL_SEED)
+                return Pipeline([("preprocessor", pre), ("classifier", clf)])
+
+            # Random split AUC = HistGBM on the bundle's existing split.
+            rand_pipe = _gbm_pipeline_for_cohort().fit(
+                _sanitize(train[full_cols], cat_full), y_train
+            )
+            random_split_auc = float(
+                roc_auc_score(
+                    y_test,
+                    rand_pipe.predict_proba(_sanitize(test[full_cols], cat_full))[:, 1],
+                )
+            )
+
+            # Chronological resplit: pool, sort by lead_created_at +
+            # lead_id (stable tiebreak), take first 85 % as train, last
+            # 15 % as test.  Mirrors ``measure_cohort_shift_from_bundle``.
+            pooled = pd.concat([train, test], ignore_index=True)
+            ts = pd.to_datetime(pooled["lead_created_at"], errors="coerce")
+            assert not ts.isna().any(), "expected every lead to have a parseable lead_created_at"
+            sort_frame = pd.DataFrame(
+                {"_ts": ts.values, "_lid": pooled["lead_id"].astype(str).values}
+            )
+            order = sort_frame.sort_values(["_ts", "_lid"], kind="stable").index.to_numpy()
+            cutoff = int(round(len(pooled) * COHORT_TRAIN_FRAC))
+            early = pooled.iloc[order[:cutoff]]
+            late = pooled.iloc[order[cutoff:]]
+            y_early = early[TASK].astype("boolean").fillna(False).astype(int).to_numpy()
+            y_late = late[TASK].astype("boolean").fillna(False).astype(int).to_numpy()
+
+            cohort_pipe = _gbm_pipeline_for_cohort().fit(
+                _sanitize(early[full_cols], cat_full), y_early
+            )
+            cohort_split_auc = float(
+                roc_auc_score(
+                    y_late,
+                    cohort_pipe.predict_proba(_sanitize(late[full_cols], cat_full))[:, 1],
+                )
+            )
+            auc_degradation = random_split_auc - cohort_split_auc
+            print(f"random_split_auc:  {random_split_auc:.4f}")
+            print(f"cohort_split_auc:  {cohort_split_auc:.4f}")
+            print(f"auc_degradation:   {auc_degradation:+.4f}  (positive = cohort is harder)")
+            """
+        ),
+        md(
+            """
+            ## 8. Bootstrap robustness — within-bundle metric variance
+
+            Cross-seed metric variance (the validation report's
+            `tiers.intermediate.spreads.gbm_auc = 0.027`) is the
+            cleanest answer to "how confident is this AUC?", but it
+            requires regenerating the bundle from N seeds — something
+            a public-bundle consumer (Kaggle / HF) can't easily do.
+
+            The within-bundle proxy is **non-parametric bootstrap of
+            the test set**. We resample the 750 test rows with
+            replacement, re-rank using the model probabilities we
+            already have, and recompute AUC / AP. 200 resamples is
+            enough to read a confidence band off the distribution.
+
+            The bootstrap variance is **smaller** than the cross-seed
+            variance — it captures sampling noise on a single
+            generated world, not generation-process noise across
+            seeds — but it's the right number for the question
+            "given *this* test set, how stable is the AUC?"
+            """
+        ),
+        code(
+            """
+            N_BOOT = 200
+            rng = np.random.default_rng(SEED)
+
+            boot_lr_auc = np.empty(N_BOOT)
+            boot_gbm_auc = np.empty(N_BOOT)
+            boot_lr_ap = np.empty(N_BOOT)
+            n_test = len(y_test)
+            for i in range(N_BOOT):
+                idx = rng.integers(0, n_test, n_test)
+                if y_test[idx].sum() == 0 or y_test[idx].sum() == n_test:
+                    # Degenerate resample — re-roll.
+                    boot_lr_auc[i] = np.nan
+                    boot_gbm_auc[i] = np.nan
+                    boot_lr_ap[i] = np.nan
+                    continue
+                boot_lr_auc[i] = roc_auc_score(y_test[idx], lr_probs[idx])
+                boot_gbm_auc[i] = roc_auc_score(y_test[idx], gbm_probs[idx])
+                boot_lr_ap[i] = average_precision_score(y_test[idx], lr_probs[idx])
+
+            def _summary(arr: np.ndarray, name: str) -> None:
+                arr = arr[~np.isnan(arr)]
+                lo, med, hi = np.quantile(arr, [0.025, 0.5, 0.975])
+                print(
+                    f"  {name:<14s}  median={med:.4f}  "
+                    f"95% CI=[{lo:.4f}, {hi:.4f}]  IQR={(np.quantile(arr,0.75)-np.quantile(arr,0.25)):.4f}"
+                )
+
+            print(f"bootstrap on test set, n_iters={N_BOOT}, seed={SEED}:")
+            _summary(boot_lr_auc, "LR AUC")
+            _summary(boot_gbm_auc, "GBM AUC")
+            _summary(boot_lr_ap, "LR AP")
+
+            fig, ax = plt.subplots(figsize=(7, 4))
+            ax.hist(boot_lr_auc, bins=30, color="#3b82f6", alpha=0.7, label="LR AUC")
+            ax.hist(boot_gbm_auc, bins=30, color="#9ca3af", alpha=0.7, label="GBM AUC")
+            ax.axvline(roc_auc_score(y_test, lr_probs), color="#1d4ed8", linestyle="--", label="LR (point)")
+            ax.axvline(roc_auc_score(y_test, gbm_probs), color="#374151", linestyle="--", label="GBM (point)")
+            ax.set_xlabel("AUC")
+            ax.set_ylabel("# bootstrap draws")
+            ax.set_title(f"Bootstrap AUC distribution (n={N_BOOT})")
+            ax.legend(loc="upper left", fontsize=8)
+            plt.tight_layout()
+            plt.show()
+            """
+        ),
+        md(
+            """
+            ## 9. Tolerance gate (G13.2)
+
+            Three groups of pinned values:
+
+            * **Cohort-shift block** — pinned to
+              `release/notebooks/_release_targets.json`'s
+              `cohort_shift.intermediate`, which is itself audit-synced
+              against `validation_report.json`'s `cohort_shift.intermediate`
+              by `tests/release/notebooks/test_release_targets_match_report.py`.
+              That audit-sync is what makes the "this notebook
+              reproduces the report" claim meaningful.
+            * **Calibration / lift / value-capture** — pinned inline
+              against the seed-42 single-run values from the
+              validation report's `per_seed[0]` block. Tolerances
+              widen for small-K metrics (P@K, value capture) because
+              their seed-to-seed variance is larger.
+            * **Bootstrap medians** — pinned inline against the
+              seed-42 point estimates (the bootstrap median converges
+              to the data-specific value, not to the cross-seed
+              median).
+
+            The headline lift sign-check (`gbm_auc > lr_auc - eps` was
+            *not* asserted — the v1 dataset documents the surprising
+            finding that LR ≥ GBM on intermediate; see
+            `release/validation/validation_report.md` gate G7.4.4).
+            """
+        ),
+        code(
+            """
+            with (Path.cwd() / "_release_targets.json").open() as fh:
+                release_targets = json.load(fh)
+            cohort_targets = release_targets["cohort_shift"]["intermediate"]
+
+            cohort_observed = {
+                "random_split_auc":  random_split_auc,
+                "cohort_split_auc":  cohort_split_auc,
+                "auc_degradation":   auc_degradation,
+            }
+            assert_within_tolerance(
+                observed=cohort_observed,
+                target=cohort_targets,
+                tolerances={
+                    # ±0.02 on AUCs — well outside numerical jitter,
+                    # well inside the band that would let the
+                    # cohort-shift sign flip silently.
+                    "random_split_auc":  0.02,
+                    "cohort_split_auc":  0.02,
+                    # Wider on the difference because both AUCs are
+                    # within tolerance, so the difference can drift up
+                    # to ±0.04 in the worst case.
+                    "auc_degradation":   0.04,
+                },
+                label="notebook 04 cohort-shift vs validation_report (intermediate)",
+            )
+
+            # Inline pins for the seed-42 single-run values *of the
+            # without-trap headline panel*.  These are not the report's
+            # published numbers (the report keeps the trap) — the
+            # report-level pin lives in section 9's cohort-shift block,
+            # which is the only metric this notebook reproduces against
+            # the report.  Notebook 02 trains the same trap-dropped LR
+            # and reports the same AUCs, so these values are also
+            # cross-checked there.
+            NB04_TARGETS = {
+                "lr_auc":             0.8737,
+                "gbm_auc":            0.8432,
+                "lr_max_bin_err":     0.1344,
+                "lift_at_5pct":       2.4819,
+                "lift_at_10pct":      2.7536,
+                "acv_cap_50":         0.1615,
+                "acv_cap_100":        0.3702,
+                # Bootstrap medians converge to the seed-42 point
+                # estimates within sampling noise.
+                "boot_lr_auc_median":  0.8757,
+                "boot_gbm_auc_median": 0.8440,
+            }
+            NB04_TOLERANCES = {
+                "lr_auc":             0.02,
+                "gbm_auc":            0.02,
+                "lr_max_bin_err":     0.05,
+                "lift_at_5pct":       0.30,
+                "lift_at_10pct":      0.30,
+                "acv_cap_50":         0.05,
+                "acv_cap_100":        0.05,
+                "boot_lr_auc_median":  0.03,
+                "boot_gbm_auc_median": 0.03,
+            }
+            observed = {
+                "lr_auc":        float(roc_auc_score(y_test, lr_probs)),
+                "gbm_auc":       float(roc_auc_score(y_test, gbm_probs)),
+                "lr_max_bin_err": float(max_bin_err),
+                "lift_at_5pct":  lifts[5.0],
+                "lift_at_10pct": lifts[10.0],
+                "acv_cap_50":    acv_capture(lr_probs, 50),
+                "acv_cap_100":   acv_capture(lr_probs, 100),
+                "boot_lr_auc_median":  float(np.nanmedian(boot_lr_auc)),
+                "boot_gbm_auc_median": float(np.nanmedian(boot_gbm_auc)),
+            }
+            assert_within_tolerance(
+                observed=observed,
+                target=NB04_TARGETS,
+                tolerances=NB04_TOLERANCES,
+                label="notebook 04 metric panel (seed 42, intermediate)",
+            )
+
+            # Sign-aware: value-aware ranking should not be worse than
+            # P-only ranking on aggregate.  The headline finding stays
+            # in the narrative regardless of the exact numbers.
+            for k, gain in value_gains.items():
+                assert gain >= -0.01, (
+                    f"value-aware ranking lost ground at top-{k} ({gain:+.4f}); "
+                    "the P×ACV story is no longer load-bearing"
+                )
+            print("OK — cohort-shift, calibration, lift, value-capture, and bootstrap all pinned.")
+            """
+        ),
+        md(
+            """
+            ## 10. Summary
+
+            * The LR baseline is well-calibrated (max bin error ≈ 0.19
+              on this seed) and lifts the top decile to ~2.6× the base
+              rate.
+            * Value-aware ranking (P × ACV) captures more revenue per
+              top-K slot than P-only ranking — the gap depends on K
+              but is positive across all sizes we tested.
+            * Cohort shift is **negative** on the intermediate tier
+              (the late cohort is *easier*, not harder); the report
+              documents this, and the notebook reproduces it. The
+              intro and advanced tiers show small positive
+              degradations.
+            * Bootstrap on the existing test split gives a within-
+              bundle confidence band that's tighter than the cross-seed
+              spread the validation report computes — useful for "how
+              confident is this single AUC" questions, not for "how
+              much does the bundle move across seeds."
+
+            ## Where to go next
+
+            1. Try cohort-shifted training in production: refit weekly
+               on the trailing 60-day window, score the next 7 days.
+            2. If you have real ACV data, swap the `expected_acv`
+               heuristic for it and recompute section 5 — the revenue
+               capture story should sharpen.
+            3. The break-me playbook in `docs/release/break_me_guide.md`
+               (coming in PR 6.3) catalogues additional stress tests
+               (target-encoding leakage, train-test contamination,
+               cohort-by-segment) and how to detect each from a
+               single bundle.
+            """
+        ),
+    ]
+
+
+def main() -> None:
+    args = builder_arg_parser(
+        default_out=DEFAULT_OUT,
+        description="Build release/notebooks/04_lift_calibration_value_ranking.ipynb",
+    ).parse_args()
+    write_notebook(args.out, assemble_notebook(cells()))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tests/release/notebooks/test_execute_notebooks.py b/tests/release/notebooks/test_execute_notebooks.py
index 24beec3..fcf847d 100644
--- a/tests/release/notebooks/test_execute_notebooks.py
+++ b/tests/release/notebooks/test_execute_notebooks.py
@@ -37,6 +37,8 @@
 _NOTEBOOKS = [
     "01_baseline_lead_scoring.ipynb",
     "02_relational_feature_engineering.ipynb",
+    "03_leakage_and_time_windows.ipynb",
+    "04_lift_calibration_value_ranking.ipynb",
 ]
 
 
diff --git a/tests/release/notebooks/test_release_targets_match_report.py b/tests/release/notebooks/test_release_targets_match_report.py
index 70f4edf..a94619a 100644
--- a/tests/release/notebooks/test_release_targets_match_report.py
+++ b/tests/release/notebooks/test_release_targets_match_report.py
@@ -25,8 +25,11 @@ def test_release_targets_match_validation_report() -> None:
     report = json.loads(_REPORT_PATH.read_text())
 
     for tier_name, tier_targets in targets.items():
-        if tier_name.startswith("_"):
-            continue  # ``_doc`` and any other meta keys
+        if tier_name.startswith("_") or tier_name == "cohort_shift":
+            # ``_doc`` and other meta keys; ``cohort_shift`` is checked
+            # separately below against ``report["cohort_shift"]`` rather
+            # than against ``report["tiers"][...]["medians"]``.
+            continue
         assert tier_name in report["tiers"], (
             f"targets file mentions tier {tier_name!r} which is absent from "
             f"validation_report.json (known tiers: {list(report['tiers'])})"
@@ -42,3 +45,39 @@ def test_release_targets_match_validation_report() -> None:
                 f"but validation_report median is {report_medians[metric_name]} — "
                 "regenerate the report or update _release_targets.json"
             )
+
+
+def test_cohort_shift_targets_match_validation_report() -> None:
+    """Audit-sync gate for the ``cohort_shift`` block.
+
+    Notebook 04 reproduces the report's chronological-resplit AUCs and
+    pins them via ``assert_within_tolerance``.  The report stores cohort-
+    shift metrics under a top-level ``cohort_shift.<tier>`` key (single
+    seed, not a cross-seed median), so the structure differs from the
+    per-tier ``medians`` block above and warrants its own audit loop.
+    """
+    targets = json.loads(_TARGETS_PATH.read_text())
+    cohort_targets = targets.get("cohort_shift", {})
+    if not cohort_targets:
+        return  # absent block is permitted; only the contents need to match
+
+    report = json.loads(_REPORT_PATH.read_text())
+    report_cohort = report["cohort_shift"]
+    for tier_name, tier_metrics in cohort_targets.items():
+        if tier_name.startswith("_"):
+            continue
+        assert tier_name in report_cohort, (
+            f"targets cohort_shift mentions tier {tier_name!r} which is absent "
+            f"from validation_report.json cohort_shift (known: {list(report_cohort)})"
+        )
+        report_block = report_cohort[tier_name]
+        for metric_name, target_value in tier_metrics.items():
+            assert metric_name in report_block, (
+                f"cohort_shift.{tier_name}.{metric_name}: pinned in targets file "
+                f"but absent from validation_report.cohort_shift.{tier_name}"
+            )
+            assert target_value == report_block[metric_name], (
+                f"cohort_shift.{tier_name}.{metric_name}: targets file has "
+                f"{target_value} but validation_report has {report_block[metric_name]} — "
+                "regenerate the report or update _release_targets.json"
+            )
diff --git a/tests/scripts/test_release_notebook_builders.py b/tests/scripts/test_release_notebook_builders.py
index c539a77..9baded8 100644
--- a/tests/scripts/test_release_notebook_builders.py
+++ b/tests/scripts/test_release_notebook_builders.py
@@ -30,6 +30,8 @@
 _BUILDERS: list[tuple[str, str]] = [
     ("build_release_notebook_01.py", "01_baseline_lead_scoring.ipynb"),
     ("build_release_notebook_02.py", "02_relational_feature_engineering.ipynb"),
+    ("build_release_notebook_03.py", "03_leakage_and_time_windows.ipynb"),
+    ("build_release_notebook_04.py", "04_lift_calibration_value_ranking.ipynb"),
 ]
 
 

From 25b9ec1511e81747c23afd1f98f0cffe6d18a129 Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Thu, 7 May 2026 12:17:46 +0300
Subject: [PATCH 2/3] PR 6.2 self-review pass: fix narrative contradictions,
 strengthen gates, kill dead code
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Brutal-reviewer pass on PR 6.2 caught real issues. Fixes folded back
in before requesting human review.

Narrative contradictions / wrong numbers (must-fix)
  - NB04 §7 opened with "train on the first 70 % chronologically"
    but the code uses 85/15 (matching the validator's
    COHORT_TRAIN_FRAC=0.85). Whiplash inside one section. Rewrote
    the opening to use 85/15 from the start and tightened the
    explanation of why random_state=0 also matters.
  - NB04 §10 summary cited max bin error ≈ 0.19 and top-decile
    lift ≈ 2.6×. Both were copied from with-trap report values; the
    headline panel drops the trap, and the actual observations are
    0.13 and 2.75×. Corrected, with a one-line note distinguishing
    the trap-dropped vs trap-kept numbers.
  - NB03 §4 cited the report's standalone-AUC of "0.55" and then
    observed 0.53 without explaining the gap. Added a sentence
    pinning the methodological difference: the report fits a full
    LR pipeline on the single column; the notebook uses the raw
    column as a score directly. Both fall in "barely above
    chance"; the difference is real and now owned.

Tolerance gates that didn't gate the claim (should-fix)
  - NB03 trap_standalone_auc was pinned to a ±0.02 band around
    0.531 — which silently allows the trap to drop to 0.49 (below
    chance) while the test still passes, breaking section 5's
    setup. Added a sign-aware ``assert standalone[TRAP] > 0.51``.
  - NB03 §3 ``mean_delta > 0`` was performative (passed if even
    one lead had a positive delta). Hardened to
    ``mean_delta > 1.0`` AND ``n_post / len(window) > 0.5``,
    sitting well below the observed 3.22 / 82 % but well above
    "barely positive."
  - NB04 §5 value-aware ranking guard was ``gain >= -0.01``
    (allows ranking to *lose* ground by up to 0.01). Strengthened
    to ``gain > MIN_VALUE_GAIN = 0.05`` so a regression that
    erodes the P×ACV story actually fails CI.
  - Audit-sync test for the cohort_shift block had
    ``if not cohort_targets: return`` — silently passed if
    someone deleted or renamed the block. Made the block
    required; if notebook 04 ever stops needing it, the test
    should be deleted, not bypassed.

Code quality (should-fix)
  - NB04 ``acv_capture(scores, k)`` re-sorted the score array on
    every call (60+ calls × O(N log N) per call). Pre-compute the
    argsort + cumsum once; every plot point is now a cheap
    cumulative-array lookup. Function signature changed from
    ``(scores, k)`` to ``(use_value: bool, k)`` and the §9
    tolerance-gate call sites were updated to match.
  - NB04 §6 narrative promised "precision / recall / count above
    threshold" but the plots showed only count and precision.
    Added a third panel for recall above threshold so the prose
    matches the figure.
  - NB03 dead code: ``with_trap_p100`` / ``without_trap_p100``
    keys were computed via ``precision_at_k(...)`` but never
    read downstream. Removed both keys and the now-unused
    ``precision_at_k`` import.
  - NB03 §4 ``standalone_window = window.dropna(...).copy()``
    was a no-op (window was already dropped of NaN in §3).
    Reuse window directly with an inline comment about why.
  - NB03 §1 ``HORIZON_DAYS`` was read from the manifest and
    printed but never used in the narrative. Wove it into the
    setup print so the 60-day post-snapshot hunting window is
    explicit before §3 measures it.
  - NB04 §5 plot title ``"Value-aware ranking captures more
    revenue per outreach slot"`` was a thesis used as a label.
    Made it neutral ("ACV capture vs top-K ..."); the conclusion
    stays in the narrative.

All 28 notebook builder + execution + audit-sync tests still pass;
1260/1260 full-suite tests pass; ruff + mypy clean; both notebooks
execute end-to-end in <10 s each.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .../03_leakage_and_time_windows.ipynb         |  59 ++++++---
 .../04_lift_calibration_value_ranking.ipynb   |  97 +++++++++-----
 scripts/build_release_notebook_03.py          |  70 +++++++---
 scripts/build_release_notebook_04.py          | 122 +++++++++++-------
 .../test_release_targets_match_report.py      |  21 ++-
 5 files changed, 247 insertions(+), 122 deletions(-)

diff --git a/release/notebooks/03_leakage_and_time_windows.ipynb b/release/notebooks/03_leakage_and_time_windows.ipynb
index 0be6d36..2130369 100644
--- a/release/notebooks/03_leakage_and_time_windows.ipynb
+++ b/release/notebooks/03_leakage_and_time_windows.ipynb
@@ -37,7 +37,7 @@
     "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n",
     "\n",
     "sys.path.insert(0, str(Path.cwd()))\n",
-    "from _notebook_utils import assert_within_tolerance, precision_at_k\n",
+    "from _notebook_utils import assert_within_tolerance\n",
     "\n",
     "SEED = 42\n",
     "BUNDLE = Path(\"../intermediate\")  # public student bundle\n",
@@ -50,7 +50,12 @@
     "assert manifest[\"relational_snapshot_safe\"] is True\n",
     "SNAPSHOT_DAY = int(manifest[\"snapshot_day\"])\n",
     "HORIZON_DAYS = int(manifest[\"horizon_days\"])\n",
-    "print(f\"snapshot_day = {SNAPSHOT_DAY}   horizon_days = {HORIZON_DAYS}\")"
+    "print(f\"snapshot anchor: day {SNAPSHOT_DAY} of a {HORIZON_DAYS}-day horizon\")\n",
+    "print(\n",
+    "    f\"any feature aggregating events past day {SNAPSHOT_DAY} \"\n",
+    "    f\"is leaking — that's the whole {HORIZON_DAYS - SNAPSHOT_DAY}-day window \"\n",
+    "    \"we're hunting in section 3\"\n",
+    ")"
    ]
   },
   {
@@ -119,9 +124,20 @@
     "    f\"  → {n_post:,} of {len(window):,} leads \"\n",
     "    f\"({n_post / len(window):.1%}) have a positive post-snapshot delta\"\n",
     ")\n",
-    "assert mean_delta > 0, (\n",
-    "    \"expected a positive mean post-snapshot delta — if zero, the trap may \"\n",
-    "    \"have been silently rebuilt as a snapshot-safe aggregate\"\n",
+    "# Real gate, not performative: on the as-shipped bundle the\n",
+    "# mean delta is ~3.2 touches/lead and ~82 % of leads have a\n",
+    "# positive delta.  The thresholds below sit well below those\n",
+    "# observations but well above \"barely above zero\" — a\n",
+    "# regeneration that erodes the trap's post-snapshot\n",
+    "# footprint will fail here even if a single lead still\n",
+    "# carries a positive delta.\n",
+    "assert mean_delta > 1.0, (\n",
+    "    f\"mean post-snapshot delta collapsed to {mean_delta:.2f} (<= 1.0) — \"\n",
+    "    \"the trap may have been silently rebuilt as a snapshot-safe aggregate\"\n",
+    ")\n",
+    "assert n_post / len(window) > 0.5, (\n",
+    "    f\"only {n_post / len(window):.1%} of leads have a positive \"\n",
+    "    \"post-snapshot delta (< 50 %); the trap's footprint has eroded\"\n",
     ")"
    ]
   },
@@ -160,7 +176,7 @@
    "cell_type": "markdown",
    "id": "cell_009",
    "metadata": {},
-   "source": "## 4. Standalone-AUC probe (the audit that almost lets the trap pass)\n\nA common leakage audit is to fit a one-feature classifier on\neach suspect column and report the standalone AUC. The\nvalidation report does this at scale — its\n`post_snapshot_aggregates` baseline trains a model on the\nsingle column `total_touches_all` and reports an AUC around\n0.55. That sounds tame, and on a busy schedule it's tempting\nto clear the column on those grounds.\n\nWe re-run the probe here so you've seen the number with your\nown eyes: ~0.53. If that's all you measure, the trap looks\nbarely worth mentioning. Section 5 shows what that audit\nmisses."
+   "source": "## 4. Standalone-AUC probe (the audit that almost lets the trap pass)\n\nA common leakage audit is to fit a one-feature classifier on\neach suspect column and report the standalone AUC. The\nvalidation report does this at scale — its\n`post_snapshot_aggregates` baseline trains a *full LR\npipeline* (median-impute + StandardScaler + LR) on the\nsingle column `total_touches_all` and reports an AUC of\n~0.55. We use a quicker probe here — the raw column\nvalue as a score, no preprocessing — which gives ~0.53 on\nthis seed. The two numbers measure slightly different\nthings (a fitted LR can re-scale and adjust, a raw-value\nranker can't), but both fall in the \"barely above chance\"\nband. On a busy schedule it's tempting to clear the column\non those grounds. Section 5 shows what that audit misses."
   },
   {
    "cell_type": "code",
@@ -169,20 +185,29 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "standalone_window = window.dropna(subset=[TRAP, \"touch_count\"]).copy()\n",
-    "y = standalone_window[TASK].astype(int).to_numpy()\n",
+    "# ``window`` was already dropped of NaN in section 3, so the\n",
+    "# raw-value ranker can use it directly.\n",
+    "y = window[TASK].astype(int).to_numpy()\n",
     "standalone = {\n",
-    "    TRAP: float(roc_auc_score(y, standalone_window[TRAP].to_numpy())),\n",
-    "    \"touch_count (snapshot-safe)\": float(\n",
-    "        roc_auc_score(y, standalone_window[\"touch_count\"].to_numpy())\n",
-    "    ),\n",
-    "    \"post-snapshot delta\": float(\n",
-    "        roc_auc_score(y, standalone_window[\"post_snapshot_touches\"].to_numpy())\n",
-    "    ),\n",
+    "    TRAP: float(roc_auc_score(y, window[TRAP].to_numpy())),\n",
+    "    \"touch_count (snapshot-safe)\": float(roc_auc_score(y, window[\"touch_count\"].to_numpy())),\n",
+    "    \"post-snapshot delta\": float(roc_auc_score(y, window[\"post_snapshot_touches\"].to_numpy())),\n",
     "}\n",
     "print(f\"{'feature':<32s}  {'standalone AUC':>16s}\")\n",
     "for name, auc in standalone.items():\n",
-    "    print(f\"  {name:<30s}  {auc:>16.4f}\")"
+    "    print(f\"  {name:<30s}  {auc:>16.4f}\")\n",
+    "\n",
+    "# Sign-aware: the section-5 narrative (\"standalone probe\n",
+    "# sees the trap as predictive-ish, but tree models extract\n",
+    "# more\") falls apart if the trap drops to chance or below.\n",
+    "# Lower bound 0.51 sits just above sampling noise; if a\n",
+    "# regeneration ever puts the trap at or below 0.50, the\n",
+    "# whole pedagogical setup needs revisiting.\n",
+    "assert standalone[TRAP] > 0.51, (\n",
+    "    f\"trap standalone AUC collapsed to {standalone[TRAP]:.3f} (<= 0.51); \"\n",
+    "    \"section 5 contrasts the standalone probe with the GBM ablation, \"\n",
+    "    \"and that contrast is empty if the trap is at or below chance\"\n",
+    ")"
    ]
   },
   {
@@ -275,8 +300,6 @@
     "    results[model] = {\n",
     "        \"with_trap_auc\": float(roc_auc_score(y_test, p_with)),\n",
     "        \"without_trap_auc\": float(roc_auc_score(y_test, p_without)),\n",
-    "        \"with_trap_p100\": precision_at_k(p_with, y_test, 100),\n",
-    "        \"without_trap_p100\": precision_at_k(p_without, y_test, 100),\n",
     "    }\n",
     "\n",
     "print(f\"{'model':<5s}  {'with trap':>10s}  {'without trap':>13s}  {'Δ AUC':>8s}\")\n",
diff --git a/release/notebooks/04_lift_calibration_value_ranking.ipynb b/release/notebooks/04_lift_calibration_value_ranking.ipynb
index a94d64b..b3eaf40 100644
--- a/release/notebooks/04_lift_calibration_value_ranking.ipynb
+++ b/release/notebooks/04_lift_calibration_value_ranking.ipynb
@@ -272,32 +272,44 @@
     "acv = pd.to_numeric(test[\"expected_acv\"], errors=\"coerce\").fillna(0.0).to_numpy()\n",
     "value_score = lr_probs * acv\n",
     "\n",
+    "# Pre-compute the ranking orders once — argsort is O(N log N)\n",
+    "# and the order doesn't change as K varies, so the ~30 plot\n",
+    "# points below should not pay for ~30 sorts.\n",
+    "total_converted_acv = float(np.sum(acv * y_test))\n",
+    "assert total_converted_acv > 0, \"no converted-ACV in the test set\"\n",
+    "order_p = np.argsort(-lr_probs, kind=\"stable\")\n",
+    "order_v = np.argsort(-value_score, kind=\"stable\")\n",
+    "captured_p = np.cumsum(acv[order_p] * y_test[order_p]) / total_converted_acv\n",
+    "captured_v = np.cumsum(acv[order_v] * y_test[order_v]) / total_converted_acv\n",
     "\n",
-    "def acv_capture(scores: np.ndarray, k: int) -> float:\n",
-    "    order = np.argsort(-scores, kind=\"stable\")\n",
-    "    captured = float(np.sum(acv[order[:k]] * y_test[order[:k]]))\n",
-    "    total = float(np.sum(acv * y_test))\n",
-    "    return captured / total if total > 0 else float(\"nan\")\n",
+    "\n",
+    "def acv_capture(use_value: bool, k: int) -> float:\n",
+    "    # 1-indexed cumulative-capture lookup (k=1 = first slot).\n",
+    "    series = captured_v if use_value else captured_p\n",
+    "    if k <= 0 or k > len(series):\n",
+    "        return float(\"nan\")\n",
+    "    return float(series[k - 1])\n",
     "\n",
     "\n",
     "print(f\"{'top-K':<6s}  {'cap by P(conv)':>14s}  {'cap by P×ACV':>13s}  {'gain':>7s}\")\n",
     "value_gains = {}\n",
     "for k in (50, 100, 200):\n",
-    "    cap_p = acv_capture(lr_probs, k)\n",
-    "    cap_v = acv_capture(value_score, k)\n",
+    "    cap_p = acv_capture(False, k)\n",
+    "    cap_v = acv_capture(True, k)\n",
     "    value_gains[k] = cap_v - cap_p\n",
     "    print(f\"  top {k:<3d}  {cap_p:>14.4f}  {cap_v:>13.4f}  {cap_v - cap_p:+7.4f}\")\n",
     "\n",
-    "# Plot side-by-side ACV capture for K in 10..300.\n",
+    "# Plot side-by-side ACV capture for K in 10..300.  Cheap now\n",
+    "# — every point is a single cumulative-array lookup.\n",
     "ks = np.arange(10, 301, 10)\n",
-    "cap_p = [acv_capture(lr_probs, int(k)) for k in ks]\n",
-    "cap_v = [acv_capture(value_score, int(k)) for k in ks]\n",
+    "cap_p_curve = [acv_capture(False, int(k)) for k in ks]\n",
+    "cap_v_curve = [acv_capture(True, int(k)) for k in ks]\n",
     "fig, ax = plt.subplots(figsize=(7, 4))\n",
-    "ax.plot(ks, cap_p, marker=\"o\", color=\"#9ca3af\", label=\"rank by P(convert)\")\n",
-    "ax.plot(ks, cap_v, marker=\"o\", color=\"#3b82f6\", label=\"rank by P(convert)×ACV\")\n",
+    "ax.plot(ks, cap_p_curve, marker=\"o\", color=\"#9ca3af\", label=\"rank by P(convert)\")\n",
+    "ax.plot(ks, cap_v_curve, marker=\"o\", color=\"#3b82f6\", label=\"rank by P(convert)×ACV\")\n",
     "ax.set_xlabel(\"top-K leads contacted\")\n",
     "ax.set_ylabel(\"Fraction of converted-ACV captured\")\n",
-    "ax.set_title(\"Value-aware ranking captures more revenue per outreach slot\")\n",
+    "ax.set_title(\"ACV capture vs top-K (rank by P only vs P × ACV)\")\n",
     "ax.legend(loc=\"lower right\")\n",
     "plt.tight_layout()\n",
     "plt.show()"
@@ -307,7 +319,7 @@
    "cell_type": "markdown",
    "id": "cell_011",
    "metadata": {},
-   "source": "## 6. Threshold selection for fixed top-K capacity\n\nSales rarely has the patience for \"score everything, run\nstats.\" The realistic ask is: *\"My team can work 50 leads\nthis week. Set a probability threshold that selects ~50\nfrom the test population.\"*\n\nWe sweep the probability threshold across the LR score\ndistribution and report precision / recall / count above\nthreshold for each step, then pick the threshold whose\ncount is closest to the requested capacity."
+   "source": "## 6. Threshold selection for fixed top-K capacity\n\nSales rarely has the patience for \"score everything, run\nstats.\" The realistic ask is: *\"My team can work 50 leads\nthis week. Set a probability threshold that selects ~50\nfrom the test population.\"*\n\nWe sweep the probability threshold across the LR score\ndistribution and report **count, precision, and recall**\nabove threshold for each step, then pick the threshold\nwhose count is closest to the requested capacity."
   },
   {
    "cell_type": "code",
@@ -317,6 +329,7 @@
    "outputs": [],
    "source": [
     "CAPACITY = 50\n",
+    "n_pos_test = max(int(y_test.sum()), 1)\n",
     "\n",
     "sorted_probs = np.sort(lr_probs)[::-1]\n",
     "# The K-th highest probability is the smallest threshold that\n",
@@ -325,7 +338,7 @@
     "mask = lr_probs >= threshold\n",
     "n_above = int(mask.sum())\n",
     "prec = float(y_test[mask].mean()) if n_above > 0 else float(\"nan\")\n",
-    "recall = float(y_test[mask].sum() / max(int(y_test.sum()), 1))\n",
+    "recall = float(y_test[mask].sum() / n_pos_test)\n",
     "print(\n",
     "    f\"capacity={CAPACITY}  threshold={threshold:.3f}  \"\n",
     "    f\"actually_above={n_above}  precision={prec:.3f}  recall={recall:.3f}\"\n",
@@ -334,27 +347,39 @@
     "# Threshold sweep — show what happens around the operating\n",
     "# point so the threshold choice is informed, not magic.\n",
     "thresholds = np.linspace(float(np.quantile(lr_probs, 0.5)), float(np.max(lr_probs)), 30)\n",
-    "counts = [int((lr_probs >= t).sum()) for t in thresholds]\n",
-    "precs = [\n",
-    "    float(y_test[lr_probs >= t].mean()) if (lr_probs >= t).sum() > 0 else 0.0 for t in thresholds\n",
-    "]\n",
-    "\n",
-    "fig, axes = plt.subplots(1, 2, figsize=(11, 4))\n",
+    "counts = []\n",
+    "precs = []\n",
+    "recalls = []\n",
+    "for t in thresholds:\n",
+    "    m = lr_probs >= t\n",
+    "    n_t = int(m.sum())\n",
+    "    counts.append(n_t)\n",
+    "    precs.append(float(y_test[m].mean()) if n_t > 0 else 0.0)\n",
+    "    recalls.append(float(y_test[m].sum() / n_pos_test))\n",
+    "\n",
+    "fig, axes = plt.subplots(1, 3, figsize=(14, 4))\n",
     "axes[0].plot(thresholds, counts, marker=\"o\", color=\"#3b82f6\")\n",
     "axes[0].axhline(CAPACITY, color=\"#ef4444\", linestyle=\"--\", label=f\"capacity={CAPACITY}\")\n",
     "axes[0].axvline(threshold, color=\"#10b981\", linestyle=\"--\", label=f\"chosen ({threshold:.3f})\")\n",
     "axes[0].set_xlabel(\"threshold\")\n",
     "axes[0].set_ylabel(\"# leads above threshold\")\n",
-    "axes[0].set_title(\"Threshold sweep — count above\")\n",
-    "axes[0].legend()\n",
+    "axes[0].set_title(\"count above\")\n",
+    "axes[0].legend(fontsize=8)\n",
     "\n",
     "axes[1].plot(thresholds, precs, marker=\"o\", color=\"#3b82f6\")\n",
     "axes[1].axhline(base_rate, color=\"#9ca3af\", linestyle=\"--\", label=f\"base rate ({base_rate:.3f})\")\n",
     "axes[1].axvline(threshold, color=\"#10b981\", linestyle=\"--\", label=f\"chosen ({threshold:.3f})\")\n",
     "axes[1].set_xlabel(\"threshold\")\n",
     "axes[1].set_ylabel(\"precision above threshold\")\n",
-    "axes[1].set_title(\"Threshold sweep — precision above\")\n",
-    "axes[1].legend()\n",
+    "axes[1].set_title(\"precision above\")\n",
+    "axes[1].legend(fontsize=8)\n",
+    "\n",
+    "axes[2].plot(thresholds, recalls, marker=\"o\", color=\"#3b82f6\")\n",
+    "axes[2].axvline(threshold, color=\"#10b981\", linestyle=\"--\", label=f\"chosen ({threshold:.3f})\")\n",
+    "axes[2].set_xlabel(\"threshold\")\n",
+    "axes[2].set_ylabel(\"recall above threshold\")\n",
+    "axes[2].set_title(\"recall above\")\n",
+    "axes[2].legend(fontsize=8)\n",
     "plt.tight_layout()\n",
     "plt.show()"
    ]
@@ -363,7 +388,7 @@
    "cell_type": "markdown",
    "id": "cell_013",
    "metadata": {},
-   "source": "## 7. Cohort-shift evaluation\n\nThe bundle's train/test split is a uniform random split of\nleads. A more realistic stress test is \"train on the first\n70 % of leads chronologically, score the last 30 % \"—\nbecause in production you always have to predict the\n*future*, never a held-out random sample of the past.\n\nWe mirror the validator's cohort-shift logic\n(`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`):\npool train + test, sort by `lead_created_at` with `lead_id`\nas a stable tiebreak, train HistGBM on the first 85 % and\nscore the last 15 % (the validator's `COHORT_TRAIN_FRAC`).\nBoth random and cohort splits use the full feature panel\n**including** the trap, matching the report's posture so\nthe numbers compare directly. The HistGBM uses\n`random_state=0` here (the validator's\n`DEFAULT_MODEL_RANDOM_STATE`) instead of the notebook's\ndefault `SEED=42` — that matters for the cohort-shift\nreproduction down to the third decimal.\n\nThe expected behaviour for the v1 intermediate tier is\n*no* degradation — the report shows the cohort split AUC\nrunning ~0.015 *higher* than the random split. That's a\nsurprise worth surfacing: the v1 simulator's intermediate\nworld doesn't drift over its 90-day horizon, so cohort\norder isn't a stressor here. The intro and advanced\ntiers show small positive degradations (intro +0.016,\nadvanced +0.010) — see\n`release/validation/validation_report.json` ⇒\n`cohort_shift`."
+   "source": "## 7. Cohort-shift evaluation\n\nThe bundle's train/test split is a uniform random split of\nleads. A more realistic stress test is \"train on the first\n85 % of leads chronologically, score the last 15 %\" —\nbecause in production you always have to predict the\n*future*, never a held-out random sample of the past.\n\nWe mirror the validator's cohort-shift logic\n(`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`)\nexactly: pool train + test, sort by `lead_created_at` with\n`lead_id` as a stable tiebreak, train HistGBM on the first\n85 % (`COHORT_TRAIN_FRAC = 0.85`) and score the last 15 %.\nBoth random and cohort splits use the full feature panel\n**including** the trap, matching the report's posture so\nthe numbers compare directly. The HistGBM uses\n`random_state=0` here (the validator's\n`DEFAULT_MODEL_RANDOM_STATE = 0`) rather than the\nnotebook's default `SEED=42` — the report's cohort-shift\nblock reproduces to four decimals only when both knobs\nmatch.\n\nThe expected behaviour for the v1 intermediate tier is\n*no* degradation — the report shows the cohort split AUC\nrunning ~0.015 *higher* than the random split. That's a\nsurprise worth surfacing: the v1 simulator's intermediate\nworld doesn't drift over its 90-day horizon, so cohort\norder isn't a stressor here. The intro and advanced\ntiers show small positive degradations (intro +0.016,\nadvanced +0.010) — see\n`release/validation/validation_report.json` ⇒\n`cohort_shift`."
   },
   {
    "cell_type": "code",
@@ -597,8 +622,8 @@
     "    \"lr_max_bin_err\": float(max_bin_err),\n",
     "    \"lift_at_5pct\": lifts[5.0],\n",
     "    \"lift_at_10pct\": lifts[10.0],\n",
-    "    \"acv_cap_50\": acv_capture(lr_probs, 50),\n",
-    "    \"acv_cap_100\": acv_capture(lr_probs, 100),\n",
+    "    \"acv_cap_50\": acv_capture(False, 50),\n",
+    "    \"acv_cap_100\": acv_capture(False, 100),\n",
     "    \"boot_lr_auc_median\": float(np.nanmedian(boot_lr_auc)),\n",
     "    \"boot_gbm_auc_median\": float(np.nanmedian(boot_gbm_auc)),\n",
     "}\n",
@@ -609,12 +634,16 @@
     "    label=\"notebook 04 metric panel (seed 42, intermediate)\",\n",
     ")\n",
     "\n",
-    "# Sign-aware: value-aware ranking should not be worse than\n",
-    "# P-only ranking on aggregate.  The headline finding stays\n",
-    "# in the narrative regardless of the exact numbers.\n",
+    "# Sign-aware: value-aware ranking should be measurably\n",
+    "# better, not just non-worse.  On this seed every K shows\n",
+    "# +0.20 ACV-capture gain; the threshold sits well below\n",
+    "# that but well above noise so a regeneration that erodes\n",
+    "# the value-aware lift fails here.\n",
+    "MIN_VALUE_GAIN = 0.05\n",
     "for k, gain in value_gains.items():\n",
-    "    assert gain >= -0.01, (\n",
-    "        f\"value-aware ranking lost ground at top-{k} ({gain:+.4f}); \"\n",
+    "    assert gain > MIN_VALUE_GAIN, (\n",
+    "        f\"value-aware ranking lift at top-{k} collapsed: \"\n",
+    "        f\"{gain:+.4f} <= {MIN_VALUE_GAIN:.4f} — \"\n",
     "        \"the P×ACV story is no longer load-bearing\"\n",
     "    )\n",
     "print(\"OK — cohort-shift, calibration, lift, value-capture, and bootstrap all pinned.\")"
@@ -624,7 +653,7 @@
    "cell_type": "markdown",
    "id": "cell_019",
    "metadata": {},
-   "source": "## 10. Summary\n\n* The LR baseline is well-calibrated (max bin error ≈ 0.19\n  on this seed) and lifts the top decile to ~2.6× the base\n  rate.\n* Value-aware ranking (P × ACV) captures more revenue per\n  top-K slot than P-only ranking — the gap depends on K\n  but is positive across all sizes we tested.\n* Cohort shift is **negative** on the intermediate tier\n  (the late cohort is *easier*, not harder); the report\n  documents this, and the notebook reproduces it. The\n  intro and advanced tiers show small positive\n  degradations.\n* Bootstrap on the existing test split gives a within-\n  bundle confidence band that's tighter than the cross-seed\n  spread the validation report computes — useful for \"how\n  confident is this single AUC\" questions, not for \"how\n  much does the bundle move across seeds.\"\n\n## Where to go next\n\n1. Try cohort-shifted training in production: refit weekly\n   on the trailing 60-day window, score the next 7 days.\n2. If you have real ACV data, swap the `expected_acv`\n   heuristic for it and recompute section 5 — the revenue\n   capture story should sharpen.\n3. The break-me playbook in `docs/release/break_me_guide.md`\n   (coming in PR 6.3) catalogues additional stress tests\n   (target-encoding leakage, train-test contamination,\n   cohort-by-segment) and how to detect each from a\n   single bundle."
+   "source": "## 10. Summary\n\n* The LR baseline is well-calibrated (max bin error ≈ 0.13\n  on the trap-dropped headline panel, vs ~0.19 on the\n  with-trap panel the validation report tracks) and lifts\n  the top decile to ~2.75× the base rate.\n* Value-aware ranking (P × ACV) captures more revenue per\n  top-K slot than P-only ranking — the gap depends on K\n  but is positive across all sizes we tested.\n* Cohort shift is **negative** on the intermediate tier\n  (the late cohort is *easier*, not harder); the report\n  documents this, and the notebook reproduces it. The\n  intro and advanced tiers show small positive\n  degradations.\n* Bootstrap on the existing test split gives a within-\n  bundle confidence band that's tighter than the cross-seed\n  spread the validation report computes — useful for \"how\n  confident is this single AUC\" questions, not for \"how\n  much does the bundle move across seeds.\"\n\n## Where to go next\n\n1. Try cohort-shifted training in production: refit weekly\n   on the trailing 60-day window, score the next 7 days.\n2. If you have real ACV data, swap the `expected_acv`\n   heuristic for it and recompute section 5 — the revenue\n   capture story should sharpen.\n3. The break-me playbook in `docs/release/break_me_guide.md`\n   (coming in PR 6.3) catalogues additional stress tests\n   (target-encoding leakage, train-test contamination,\n   cohort-by-segment) and how to detect each from a\n   single bundle."
   }
  ],
  "metadata": {
diff --git a/scripts/build_release_notebook_03.py b/scripts/build_release_notebook_03.py
index 3df45bb..e811af6 100644
--- a/scripts/build_release_notebook_03.py
+++ b/scripts/build_release_notebook_03.py
@@ -95,7 +95,7 @@ def cells() -> list[nbf.NotebookNode]:
             from sklearn.preprocessing import OneHotEncoder, StandardScaler
 
             sys.path.insert(0, str(Path.cwd()))
-            from _notebook_utils import assert_within_tolerance, precision_at_k
+            from _notebook_utils import assert_within_tolerance
 
             SEED = 42
             BUNDLE = Path("../intermediate")          # public student bundle
@@ -108,7 +108,12 @@ def cells() -> list[nbf.NotebookNode]:
             assert manifest["relational_snapshot_safe"] is True
             SNAPSHOT_DAY = int(manifest["snapshot_day"])
             HORIZON_DAYS = int(manifest["horizon_days"])
-            print(f"snapshot_day = {SNAPSHOT_DAY}   horizon_days = {HORIZON_DAYS}")
+            print(f"snapshot anchor: day {SNAPSHOT_DAY} of a {HORIZON_DAYS}-day horizon")
+            print(
+                f"any feature aggregating events past day {SNAPSHOT_DAY} "
+                f"is leaking — that's the whole {HORIZON_DAYS - SNAPSHOT_DAY}-day window "
+                "we're hunting in section 3"
+            )
             """
         ),
         md(
@@ -200,9 +205,20 @@ def cells() -> list[nbf.NotebookNode]:
                 f"  → {n_post:,} of {len(window):,} leads "
                 f"({n_post / len(window):.1%}) have a positive post-snapshot delta"
             )
-            assert mean_delta > 0, (
-                "expected a positive mean post-snapshot delta — if zero, the trap may "
-                "have been silently rebuilt as a snapshot-safe aggregate"
+            # Real gate, not performative: on the as-shipped bundle the
+            # mean delta is ~3.2 touches/lead and ~82 % of leads have a
+            # positive delta.  The thresholds below sit well below those
+            # observations but well above "barely above zero" — a
+            # regeneration that erodes the trap's post-snapshot
+            # footprint will fail here even if a single lead still
+            # carries a positive delta.
+            assert mean_delta > 1.0, (
+                f"mean post-snapshot delta collapsed to {mean_delta:.2f} (<= 1.0) — "
+                "the trap may have been silently rebuilt as a snapshot-safe aggregate"
+            )
+            assert n_post / len(window) > 0.5, (
+                f"only {n_post / len(window):.1%} of leads have a positive "
+                "post-snapshot delta (< 50 %); the trap's footprint has eroded"
             )
             """
         ),
@@ -251,33 +267,47 @@ def cells() -> list[nbf.NotebookNode]:
             A common leakage audit is to fit a one-feature classifier on
             each suspect column and report the standalone AUC. The
             validation report does this at scale — its
-            `post_snapshot_aggregates` baseline trains a model on the
-            single column `total_touches_all` and reports an AUC around
-            0.55. That sounds tame, and on a busy schedule it's tempting
-            to clear the column on those grounds.
-
-            We re-run the probe here so you've seen the number with your
-            own eyes: ~0.53. If that's all you measure, the trap looks
-            barely worth mentioning. Section 5 shows what that audit
-            misses.
+            `post_snapshot_aggregates` baseline trains a *full LR
+            pipeline* (median-impute + StandardScaler + LR) on the
+            single column `total_touches_all` and reports an AUC of
+            ~0.55. We use a quicker probe here — the raw column
+            value as a score, no preprocessing — which gives ~0.53 on
+            this seed. The two numbers measure slightly different
+            things (a fitted LR can re-scale and adjust, a raw-value
+            ranker can't), but both fall in the "barely above chance"
+            band. On a busy schedule it's tempting to clear the column
+            on those grounds. Section 5 shows what that audit misses.
             """
         ),
         code(
             """
-            standalone_window = window.dropna(subset=[TRAP, "touch_count"]).copy()
-            y = standalone_window[TASK].astype(int).to_numpy()
+            # ``window`` was already dropped of NaN in section 3, so the
+            # raw-value ranker can use it directly.
+            y = window[TASK].astype(int).to_numpy()
             standalone = {
-                TRAP: float(roc_auc_score(y, standalone_window[TRAP].to_numpy())),
+                TRAP: float(roc_auc_score(y, window[TRAP].to_numpy())),
                 "touch_count (snapshot-safe)": float(
-                    roc_auc_score(y, standalone_window["touch_count"].to_numpy())
+                    roc_auc_score(y, window["touch_count"].to_numpy())
                 ),
                 "post-snapshot delta": float(
-                    roc_auc_score(y, standalone_window["post_snapshot_touches"].to_numpy())
+                    roc_auc_score(y, window["post_snapshot_touches"].to_numpy())
                 ),
             }
             print(f"{'feature':<32s}  {'standalone AUC':>16s}")
             for name, auc in standalone.items():
                 print(f"  {name:<30s}  {auc:>16.4f}")
+
+            # Sign-aware: the section-5 narrative ("standalone probe
+            # sees the trap as predictive-ish, but tree models extract
+            # more") falls apart if the trap drops to chance or below.
+            # Lower bound 0.51 sits just above sampling noise; if a
+            # regeneration ever puts the trap at or below 0.50, the
+            # whole pedagogical setup needs revisiting.
+            assert standalone[TRAP] > 0.51, (
+                f"trap standalone AUC collapsed to {standalone[TRAP]:.3f} (<= 0.51); "
+                "section 5 contrasts the standalone probe with the GBM ablation, "
+                "and that contrast is empty if the trap is at or below chance"
+            )
             """
         ),
         md(
@@ -368,8 +398,6 @@ def fit_score(cols: list[str], *, model: str) -> np.ndarray:
                 results[model] = {
                     "with_trap_auc":    float(roc_auc_score(y_test, p_with)),
                     "without_trap_auc": float(roc_auc_score(y_test, p_without)),
-                    "with_trap_p100":   precision_at_k(p_with, y_test, 100),
-                    "without_trap_p100": precision_at_k(p_without, y_test, 100),
                 }
 
             print(f"{'model':<5s}  {'with trap':>10s}  {'without trap':>13s}  {'Δ AUC':>8s}")
diff --git a/scripts/build_release_notebook_04.py b/scripts/build_release_notebook_04.py
index 0c6dbb0..d9d87f9 100644
--- a/scripts/build_release_notebook_04.py
+++ b/scripts/build_release_notebook_04.py
@@ -358,30 +358,42 @@ def build_pipeline(num: list[str], cat: list[str], *, model: str) -> Pipeline:
             acv = pd.to_numeric(test["expected_acv"], errors="coerce").fillna(0.0).to_numpy()
             value_score = lr_probs * acv
 
-            def acv_capture(scores: np.ndarray, k: int) -> float:
-                order = np.argsort(-scores, kind="stable")
-                captured = float(np.sum(acv[order[:k]] * y_test[order[:k]]))
-                total = float(np.sum(acv * y_test))
-                return captured / total if total > 0 else float("nan")
+            # Pre-compute the ranking orders once — argsort is O(N log N)
+            # and the order doesn't change as K varies, so the ~30 plot
+            # points below should not pay for ~30 sorts.
+            total_converted_acv = float(np.sum(acv * y_test))
+            assert total_converted_acv > 0, "no converted-ACV in the test set"
+            order_p = np.argsort(-lr_probs, kind="stable")
+            order_v = np.argsort(-value_score, kind="stable")
+            captured_p = np.cumsum(acv[order_p] * y_test[order_p]) / total_converted_acv
+            captured_v = np.cumsum(acv[order_v] * y_test[order_v]) / total_converted_acv
+
+            def acv_capture(use_value: bool, k: int) -> float:
+                # 1-indexed cumulative-capture lookup (k=1 = first slot).
+                series = captured_v if use_value else captured_p
+                if k <= 0 or k > len(series):
+                    return float("nan")
+                return float(series[k - 1])
 
             print(f"{'top-K':<6s}  {'cap by P(conv)':>14s}  {'cap by P×ACV':>13s}  {'gain':>7s}")
             value_gains = {}
             for k in (50, 100, 200):
-                cap_p = acv_capture(lr_probs, k)
-                cap_v = acv_capture(value_score, k)
+                cap_p = acv_capture(False, k)
+                cap_v = acv_capture(True, k)
                 value_gains[k] = cap_v - cap_p
                 print(f"  top {k:<3d}  {cap_p:>14.4f}  {cap_v:>13.4f}  {cap_v - cap_p:+7.4f}")
 
-            # Plot side-by-side ACV capture for K in 10..300.
+            # Plot side-by-side ACV capture for K in 10..300.  Cheap now
+            # — every point is a single cumulative-array lookup.
             ks = np.arange(10, 301, 10)
-            cap_p = [acv_capture(lr_probs, int(k)) for k in ks]
-            cap_v = [acv_capture(value_score, int(k)) for k in ks]
+            cap_p_curve = [acv_capture(False, int(k)) for k in ks]
+            cap_v_curve = [acv_capture(True, int(k)) for k in ks]
             fig, ax = plt.subplots(figsize=(7, 4))
-            ax.plot(ks, cap_p, marker="o", color="#9ca3af", label="rank by P(convert)")
-            ax.plot(ks, cap_v, marker="o", color="#3b82f6", label="rank by P(convert)×ACV")
+            ax.plot(ks, cap_p_curve, marker="o", color="#9ca3af", label="rank by P(convert)")
+            ax.plot(ks, cap_v_curve, marker="o", color="#3b82f6", label="rank by P(convert)×ACV")
             ax.set_xlabel("top-K leads contacted")
             ax.set_ylabel("Fraction of converted-ACV captured")
-            ax.set_title("Value-aware ranking captures more revenue per outreach slot")
+            ax.set_title("ACV capture vs top-K (rank by P only vs P × ACV)")
             ax.legend(loc="lower right")
             plt.tight_layout()
             plt.show()
@@ -397,14 +409,15 @@ def acv_capture(scores: np.ndarray, k: int) -> float:
             from the test population."*
 
             We sweep the probability threshold across the LR score
-            distribution and report precision / recall / count above
-            threshold for each step, then pick the threshold whose
-            count is closest to the requested capacity.
+            distribution and report **count, precision, and recall**
+            above threshold for each step, then pick the threshold
+            whose count is closest to the requested capacity.
             """
         ),
         code(
             """
             CAPACITY = 50
+            n_pos_test = max(int(y_test.sum()), 1)
 
             sorted_probs = np.sort(lr_probs)[::-1]
             # The K-th highest probability is the smallest threshold that
@@ -413,7 +426,7 @@ def acv_capture(scores: np.ndarray, k: int) -> float:
             mask = lr_probs >= threshold
             n_above = int(mask.sum())
             prec = float(y_test[mask].mean()) if n_above > 0 else float("nan")
-            recall = float(y_test[mask].sum() / max(int(y_test.sum()), 1))
+            recall = float(y_test[mask].sum() / n_pos_test)
             print(
                 f"capacity={CAPACITY}  threshold={threshold:.3f}  "
                 f"actually_above={n_above}  precision={prec:.3f}  recall={recall:.3f}"
@@ -424,28 +437,39 @@ def acv_capture(scores: np.ndarray, k: int) -> float:
             thresholds = np.linspace(
                 float(np.quantile(lr_probs, 0.5)), float(np.max(lr_probs)), 30
             )
-            counts = [int((lr_probs >= t).sum()) for t in thresholds]
-            precs = [
-                float(y_test[lr_probs >= t].mean()) if (lr_probs >= t).sum() > 0 else 0.0
-                for t in thresholds
-            ]
-
-            fig, axes = plt.subplots(1, 2, figsize=(11, 4))
+            counts = []
+            precs = []
+            recalls = []
+            for t in thresholds:
+                m = lr_probs >= t
+                n_t = int(m.sum())
+                counts.append(n_t)
+                precs.append(float(y_test[m].mean()) if n_t > 0 else 0.0)
+                recalls.append(float(y_test[m].sum() / n_pos_test))
+
+            fig, axes = plt.subplots(1, 3, figsize=(14, 4))
             axes[0].plot(thresholds, counts, marker="o", color="#3b82f6")
             axes[0].axhline(CAPACITY, color="#ef4444", linestyle="--", label=f"capacity={CAPACITY}")
             axes[0].axvline(threshold, color="#10b981", linestyle="--", label=f"chosen ({threshold:.3f})")
             axes[0].set_xlabel("threshold")
             axes[0].set_ylabel("# leads above threshold")
-            axes[0].set_title("Threshold sweep — count above")
-            axes[0].legend()
+            axes[0].set_title("count above")
+            axes[0].legend(fontsize=8)
 
             axes[1].plot(thresholds, precs, marker="o", color="#3b82f6")
             axes[1].axhline(base_rate, color="#9ca3af", linestyle="--", label=f"base rate ({base_rate:.3f})")
             axes[1].axvline(threshold, color="#10b981", linestyle="--", label=f"chosen ({threshold:.3f})")
             axes[1].set_xlabel("threshold")
             axes[1].set_ylabel("precision above threshold")
-            axes[1].set_title("Threshold sweep — precision above")
-            axes[1].legend()
+            axes[1].set_title("precision above")
+            axes[1].legend(fontsize=8)
+
+            axes[2].plot(thresholds, recalls, marker="o", color="#3b82f6")
+            axes[2].axvline(threshold, color="#10b981", linestyle="--", label=f"chosen ({threshold:.3f})")
+            axes[2].set_xlabel("threshold")
+            axes[2].set_ylabel("recall above threshold")
+            axes[2].set_title("recall above")
+            axes[2].legend(fontsize=8)
             plt.tight_layout()
             plt.show()
             """
@@ -456,22 +480,23 @@ def acv_capture(scores: np.ndarray, k: int) -> float:
 
             The bundle's train/test split is a uniform random split of
             leads. A more realistic stress test is "train on the first
-            70 % of leads chronologically, score the last 30 % "—
+            85 % of leads chronologically, score the last 15 %" —
             because in production you always have to predict the
             *future*, never a held-out random sample of the past.
 
             We mirror the validator's cohort-shift logic
-            (`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`):
-            pool train + test, sort by `lead_created_at` with `lead_id`
-            as a stable tiebreak, train HistGBM on the first 85 % and
-            score the last 15 % (the validator's `COHORT_TRAIN_FRAC`).
+            (`leadforge.validation.release_quality.measure_cohort_shift_from_bundle`)
+            exactly: pool train + test, sort by `lead_created_at` with
+            `lead_id` as a stable tiebreak, train HistGBM on the first
+            85 % (`COHORT_TRAIN_FRAC = 0.85`) and score the last 15 %.
             Both random and cohort splits use the full feature panel
             **including** the trap, matching the report's posture so
             the numbers compare directly. The HistGBM uses
             `random_state=0` here (the validator's
-            `DEFAULT_MODEL_RANDOM_STATE`) instead of the notebook's
-            default `SEED=42` — that matters for the cohort-shift
-            reproduction down to the third decimal.
+            `DEFAULT_MODEL_RANDOM_STATE = 0`) rather than the
+            notebook's default `SEED=42` — the report's cohort-shift
+            block reproduces to four decimals only when both knobs
+            match.
 
             The expected behaviour for the v1 intermediate tier is
             *no* degradation — the report shows the cohort split AUC
@@ -749,8 +774,8 @@ def _summary(arr: np.ndarray, name: str) -> None:
                 "lr_max_bin_err": float(max_bin_err),
                 "lift_at_5pct":  lifts[5.0],
                 "lift_at_10pct": lifts[10.0],
-                "acv_cap_50":    acv_capture(lr_probs, 50),
-                "acv_cap_100":   acv_capture(lr_probs, 100),
+                "acv_cap_50":    acv_capture(False, 50),
+                "acv_cap_100":   acv_capture(False, 100),
                 "boot_lr_auc_median":  float(np.nanmedian(boot_lr_auc)),
                 "boot_gbm_auc_median": float(np.nanmedian(boot_gbm_auc)),
             }
@@ -761,12 +786,16 @@ def _summary(arr: np.ndarray, name: str) -> None:
                 label="notebook 04 metric panel (seed 42, intermediate)",
             )
 
-            # Sign-aware: value-aware ranking should not be worse than
-            # P-only ranking on aggregate.  The headline finding stays
-            # in the narrative regardless of the exact numbers.
+            # Sign-aware: value-aware ranking should be measurably
+            # better, not just non-worse.  On this seed every K shows
+            # +0.20 ACV-capture gain; the threshold sits well below
+            # that but well above noise so a regeneration that erodes
+            # the value-aware lift fails here.
+            MIN_VALUE_GAIN = 0.05
             for k, gain in value_gains.items():
-                assert gain >= -0.01, (
-                    f"value-aware ranking lost ground at top-{k} ({gain:+.4f}); "
+                assert gain > MIN_VALUE_GAIN, (
+                    f"value-aware ranking lift at top-{k} collapsed: "
+                    f"{gain:+.4f} <= {MIN_VALUE_GAIN:.4f} — "
                     "the P×ACV story is no longer load-bearing"
                 )
             print("OK — cohort-shift, calibration, lift, value-capture, and bootstrap all pinned.")
@@ -776,9 +805,10 @@ def _summary(arr: np.ndarray, name: str) -> None:
             """
             ## 10. Summary
 
-            * The LR baseline is well-calibrated (max bin error ≈ 0.19
-              on this seed) and lifts the top decile to ~2.6× the base
-              rate.
+            * The LR baseline is well-calibrated (max bin error ≈ 0.13
+              on the trap-dropped headline panel, vs ~0.19 on the
+              with-trap panel the validation report tracks) and lifts
+              the top decile to ~2.75× the base rate.
             * Value-aware ranking (P × ACV) captures more revenue per
               top-K slot than P-only ranking — the gap depends on K
               but is positive across all sizes we tested.
diff --git a/tests/release/notebooks/test_release_targets_match_report.py b/tests/release/notebooks/test_release_targets_match_report.py
index a94619a..f5a35a2 100644
--- a/tests/release/notebooks/test_release_targets_match_report.py
+++ b/tests/release/notebooks/test_release_targets_match_report.py
@@ -55,17 +55,28 @@ def test_cohort_shift_targets_match_validation_report() -> None:
     shift metrics under a top-level ``cohort_shift.<tier>`` key (single
     seed, not a cross-seed median), so the structure differs from the
     per-tier ``medians`` block above and warrants its own audit loop.
+
+    The block is **required**: notebook 04's tolerance gate reads it
+    directly, and silently allowing it to disappear would defeat the
+    audit-sync invariant.  If notebook 04 ever stops needing this
+    block, the test should be deleted, not bypassed.
     """
     targets = json.loads(_TARGETS_PATH.read_text())
-    cohort_targets = targets.get("cohort_shift", {})
-    if not cohort_targets:
-        return  # absent block is permitted; only the contents need to match
+    assert "cohort_shift" in targets, (
+        "release_targets is missing the 'cohort_shift' block that notebook 04 "
+        "reads at runtime — re-add it (sourced from "
+        "validation_report.cohort_shift) or delete this test if the notebook "
+        "no longer needs it"
+    )
+    cohort_targets = targets["cohort_shift"]
 
     report = json.loads(_REPORT_PATH.read_text())
     report_cohort = report["cohort_shift"]
+    tiers_checked = 0
     for tier_name, tier_metrics in cohort_targets.items():
         if tier_name.startswith("_"):
             continue
+        tiers_checked += 1
         assert tier_name in report_cohort, (
             f"targets cohort_shift mentions tier {tier_name!r} which is absent "
             f"from validation_report.json cohort_shift (known: {list(report_cohort)})"
@@ -81,3 +92,7 @@ def test_cohort_shift_targets_match_validation_report() -> None:
                 f"{target_value} but validation_report has {report_block[metric_name]} — "
                 "regenerate the report or update _release_targets.json"
             )
+    assert tiers_checked > 0, (
+        "cohort_shift block contained only meta keys (none starting without an "
+        "underscore) — at least one tier must be pinned"
+    )

From 397feb544b3314f235a58cd5bb73bed480ba652f Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Thu, 7 May 2026 13:45:32 +0300
Subject: [PATCH 3/3] PR 6.2 Copilot review pass: fold all four real findings
 back in
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Copilot's review of the PR 6.2 self-review surfaced four still-real
findings (two others were marked outdated — already addressed by the
self-review pass).  Folded all four back in before requesting human
review.

  - (COPILOT-1) Grammar: "reflect a honest production model" →
    "reflect an honest production model" in NB04 §2 narrative.
  - (COPILOT-3) The threshold-selection comment claimed the K-th
    highest probability + ``mask = lr_probs >= threshold`` admits
    *exactly* K leads, "ties resolved by score order".  The
    inclusive comparison can admit more than K when leads share the
    threshold's probability — and there is no implicit tie-break.
    Rewrote the comment to be honest about the semantics ("admits
    AT LEAST K via probs >= threshold; ties at the threshold can
    inflate the slate; ``actually_above`` makes the realised count
    visible").  Kept the threshold-based selection rather than
    switching to a true top-K via ``argsort`` because the
    pedagogical point of section 6 is *threshold selection*, not
    *rank cutoff*.
  - (COPILOT-4) Bootstrap loop comment said "Degenerate resample —
    re-roll" but the implementation writes NaN and continues.
    Rewrote the comment to match what the code does (mark NaN, let
    ``_summary`` filter it out) and added the actual probability
    bound — with n_test=750 and base rate ~22 %, the all-positive
    or all-negative draw probability is ~10⁻¹⁰⁰, so the branch is
    dead in practice and exists only as a defensive safety net for
    tiny test sets.  Implementing a real re-roll loop would never
    execute on this dataset.
  - (COPILOT-6) The top-level ``_doc`` in
    ``release/notebooks/_release_targets.json`` claimed the file
    contains only "cross-seed-median metric values", which became
    inaccurate after the PR 6.2 cohort_shift block (single-seed,
    seed 42).  Rewrote the docstring to call out the mixed
    structure: per-tier blocks hold cross-seed medians; cohort_shift
    block holds single-seed values from
    ``validation_report.cohort_shift``.  Audit-sync test continues
    to enforce both cases via separate loops.

  - (COPILOT-2, COPILOT-5) outdated; already addressed by the
    self-review fix-up commit (25b9ec1) — the 70/30-vs-85/15
    narrative whiplash and the wrong summary numbers (0.19 / 2.6×
    vs the actual 0.13 / 2.75×) are both fixed there.  Resolved as
    already-treated.

Net: 28/28 notebook builder + execution + audit-sync tests pass;
ruff + mypy clean; both notebooks still execute end-to-end in
<10 s each.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .../04_lift_calibration_value_ranking.ipynb   | 19 ++++++++++++++++---
 release/notebooks/_release_targets.json       |  2 +-
 scripts/build_release_notebook_04.py          | 19 ++++++++++++++++---
 3 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/release/notebooks/04_lift_calibration_value_ranking.ipynb b/release/notebooks/04_lift_calibration_value_ranking.ipynb
index b3eaf40..25933be 100644
--- a/release/notebooks/04_lift_calibration_value_ranking.ipynb
+++ b/release/notebooks/04_lift_calibration_value_ranking.ipynb
@@ -63,7 +63,7 @@
    "cell_type": "markdown",
    "id": "cell_003",
    "metadata": {},
-   "source": "## 2. Train the headline LR + GBM panel\n\nSame preprocessing as notebooks 01 / 02 (mirrors\n`leadforge.validation.release_quality._build_pipeline`).\nWe drop the documented leakage trap `total_touches_all`\nhere so the calibration / lift / value plots in sections\n3–6 reflect a honest production model. The cohort-shift\nsection in section 7 uses the validator's full-panel\nposture (trap kept) so its number is comparable to the\npublished validation report."
+   "source": "## 2. Train the headline LR + GBM panel\n\nSame preprocessing as notebooks 01 / 02 (mirrors\n`leadforge.validation.release_quality._build_pipeline`).\nWe drop the documented leakage trap `total_touches_all`\nhere so the calibration / lift / value plots in sections\n3–6 reflect an honest production model. The cohort-shift\nsection in section 7 uses the validator's full-panel\nposture (trap kept) so its number is comparable to the\npublished validation report."
   },
   {
    "cell_type": "code",
@@ -333,7 +333,14 @@
     "\n",
     "sorted_probs = np.sort(lr_probs)[::-1]\n",
     "# The K-th highest probability is the smallest threshold that\n",
-    "# admits exactly K leads (ties resolved by score order).\n",
+    "# admits AT LEAST K leads via ``probs >= threshold``.  If\n",
+    "# several leads share that probability, the inclusive\n",
+    "# comparison can admit more than K — that's a property of\n",
+    "# threshold-based selection, not a bug.  The\n",
+    "# ``actually_above`` readout below makes the realised count\n",
+    "# visible so the operator can see when ties are inflating\n",
+    "# the slate (and decide whether to break them with a\n",
+    "# secondary score).\n",
     "threshold = float(sorted_probs[CAPACITY - 1])\n",
     "mask = lr_probs >= threshold\n",
     "n_above = int(mask.sum())\n",
@@ -508,7 +515,13 @@
     "for i in range(N_BOOT):\n",
     "    idx = rng.integers(0, n_test, n_test)\n",
     "    if y_test[idx].sum() == 0 or y_test[idx].sum() == n_test:\n",
-    "        # Degenerate resample — re-roll.\n",
+    "        # Degenerate resample (all-positive or all-negative)\n",
+    "        # — ``roc_auc_score`` is undefined here.  We mark\n",
+    "        # the iteration NaN and let ``_summary`` filter it\n",
+    "        # out; with n_test=750 and base rate ~22 %, the\n",
+    "        # probability of a degenerate draw is ~10⁻¹⁰⁰, so\n",
+    "        # this branch is dead in practice.  Kept as a\n",
+    "        # defensive safety net for tiny test sets.\n",
     "        boot_lr_auc[i] = np.nan\n",
     "        boot_gbm_auc[i] = np.nan\n",
     "        boot_lr_ap[i] = np.nan\n",
diff --git a/release/notebooks/_release_targets.json b/release/notebooks/_release_targets.json
index 9d36f7c..e455d02 100644
--- a/release/notebooks/_release_targets.json
+++ b/release/notebooks/_release_targets.json
@@ -1,5 +1,5 @@
 {
- "_doc": "Cross-seed-median metric values from release/validation/validation_report.json, sliced to the metrics the release notebooks pin via assert_within_tolerance. Audited against the report by tests/release/notebooks/test_release_targets_match_report.py — if you change a value here, the test will fail unless the corresponding median in the validation report changes to match.",
+ "_doc": "Reproduction targets that release notebooks pin via assert_within_tolerance, sourced from release/validation/validation_report.json. Mixed structure: per-tier blocks (intermediate, ...) hold cross-seed-median metrics from validation_report.tiers.<tier>.medians; the cohort_shift block holds single-seed (seed 42) metrics from validation_report.cohort_shift.<tier>, since the report runs cohort-shift on seed 42 only. Audited against the report by tests/release/notebooks/test_release_targets_match_report.py — if you change a value here, the test will fail unless the corresponding source value in the validation report changes to match.",
  "cohort_shift": {
   "_doc": "Per-tier cohort-shift metrics from validation_report.cohort_shift (single-seed values; the report runs cohort-shift only on seed 42). Notebook 04 reproduces these via a chronological resplit and pins them via assert_within_tolerance.",
   "intermediate": {
diff --git a/scripts/build_release_notebook_04.py b/scripts/build_release_notebook_04.py
index d9d87f9..f57bbf8 100644
--- a/scripts/build_release_notebook_04.py
+++ b/scripts/build_release_notebook_04.py
@@ -128,7 +128,7 @@ def cells() -> list[nbf.NotebookNode]:
             `leadforge.validation.release_quality._build_pipeline`).
             We drop the documented leakage trap `total_touches_all`
             here so the calibration / lift / value plots in sections
-            3–6 reflect a honest production model. The cohort-shift
+            3–6 reflect an honest production model. The cohort-shift
             section in section 7 uses the validator's full-panel
             posture (trap kept) so its number is comparable to the
             published validation report.
@@ -421,7 +421,14 @@ def acv_capture(use_value: bool, k: int) -> float:
 
             sorted_probs = np.sort(lr_probs)[::-1]
             # The K-th highest probability is the smallest threshold that
-            # admits exactly K leads (ties resolved by score order).
+            # admits AT LEAST K leads via ``probs >= threshold``.  If
+            # several leads share that probability, the inclusive
+            # comparison can admit more than K — that's a property of
+            # threshold-based selection, not a bug.  The
+            # ``actually_above`` readout below makes the realised count
+            # visible so the operator can see when ties are inflating
+            # the slate (and decide whether to break them with a
+            # secondary score).
             threshold = float(sorted_probs[CAPACITY - 1])
             mask = lr_probs >= threshold
             n_above = int(mask.sum())
@@ -644,7 +651,13 @@ def _gbm_pipeline_for_cohort() -> Pipeline:
             for i in range(N_BOOT):
                 idx = rng.integers(0, n_test, n_test)
                 if y_test[idx].sum() == 0 or y_test[idx].sum() == n_test:
-                    # Degenerate resample — re-roll.
+                    # Degenerate resample (all-positive or all-negative)
+                    # — ``roc_auc_score`` is undefined here.  We mark
+                    # the iteration NaN and let ``_summary`` filter it
+                    # out; with n_test=750 and base rate ~22 %, the
+                    # probability of a degenerate draw is ~10⁻¹⁰⁰, so
+                    # this branch is dead in practice.  Kept as a
+                    # defensive safety net for tiny test sets.
                     boot_lr_auc[i] = np.nan
                     boot_gbm_auc[i] = np.nan
                     boot_lr_ap[i] = np.nan