diff --git a/.agent-plan.md b/.agent-plan.md
index 6ee4d2d..f715f35 100644
--- a/.agent-plan.md
+++ b/.agent-plan.md
@@ -58,8 +58,8 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
 - [x] PR 5.2 (self-review pass): brutal review of the first revision caught real bugs the reviewer would otherwise have to call out. Fixes folded into the PR before review:  (#1) `run_packager` validate→write order — both packagers were writing the README/metadata even when validation failed; the validation gate now early-returns with `errors` populated and zero artifacts on disk; new tests exercise the no-write path on both sides.  (#2) Instructor README was inlining the public 3-tier README for a 1-tier dataset; replaced with a dedicated `INSTRUCTOR_BODY` constant that opens by linking to the public dataset, describes only the instructor-specific additions (full-horizon tables, hidden DAG, latent registry, mechanism summary), and uses the single-tier tree block.  (#3) `validate_upload_dir_safe` now also rejects strict descendants of `release_dir` (e.g. `--huggingface-dir release/intro` would otherwise rmtree the intro bundle); allow-list keeps the canonical `release/{kaggle,huggingface,huggingface-instructor}` direct-children.  (#4) `[publish]` extra in `pyproject.toml` (`datasets>=2.14`, `kaggle>=1.6`) makes the gated `load_dataset()` / Kaggle-CLI tests installable in a single command — closes the "G12.3/G12.4 untested in CI" gap to a one-line install.  (#5) Shared-primitives extraction finished: `SOURCE_TREE_BLOCK`, `validate_readme_substitution`, `replace_file`, `replace_dir`, `load_manifest` all moved to `scripts/_release_common.py`; both packagers reduced to imports.  (#6) Hand-rolled YAML renderer (60 lines + brittle quoting heuristic) replaced with `yaml.safe_dump` + a 4-line `_IndentedDumper` subclass that forces indent-2 on top-level sequences.  (#7) Dead `--owner` / `--dataset-slug` CLI flags removed (PR 7.2 will add them when actually needed).  (#8) `assemble_upload_dir` now takes `rendered_readme` as a parameter and writes it itself; the public name no longer lies about producing a complete tree.  (#9) `build_config_for_tier` made pure (no I/O); `_assert_tier_dir_exists` does the cheap manifest-stat preflight.  (#10) `--default-config` with `--variant=instructor` now errors instead of silently ignoring.  (#11) Instructor tree-diagram drops the hardcoded "9 tables" claim.  (#13–#16) Visual cleanups (duplicate divider, ruff-split imports, `COVER_IMAGE_FILENAME`-vs-`Path.name` redundancy, speculative comment about HF split rename).  (#17) Test cruft removed (unused `tmp_path`, dead `tag_lines`); em-dash YAML round-trip parametrised for the instructor `pretty_name`.  Net: 1223/1223 tests pass + 5 gated skips (4 `datasets`-SDK round-trip + 1 Kaggle `kaggle`-SDK from PR 5.1); ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR doesn't touch the bundle shape).
 
 ### Phase 6 — Notebook sequence + adversarial framing
-- [ ] `release/notebooks/{02_relational_feature_engineering,03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb`
-- [ ] Update `01_baseline_lead_scoring.ipynb` to reproduce validation report metrics
+- [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added.  Notebook 01 trains LR + HistGBM on the public `intermediate` bundle using the **same feature set as the validation report** (drops only IDs and the label, mirrors `release_quality._partition_columns`), so the G13.2 reproduction gate compares apples to apples.  This means notebook 01 **keeps** `total_touches_all` (the documented leakage trap) — narrative cell calls it out explicitly and forward-points to notebook 03 (PR 6.2) which dissects what dropping the trap does to performance.  Notebook 02 by contrast **drops** the trap from the flat baseline so the relational lift attribution stays clean (its goal is teaching feature engineering, not reproducing the report).  Targets are loaded at runtime from `release/notebooks/_release_targets.json` (audit-synced against `release/validation/validation_report.json` by `tests/release/notebooks/test_release_targets_match_report.py`); per-metric tolerances replace the original flat ±0.05 (AUC/Brier ±0.02, AP / top-decile ±0.05).  Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on **train leads only**, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, prints a 4-row metric panel + delta panel, and pins the four model AUCs and the headline `GBM(eng) − GBM(flat)` lift via `assert_within_tolerance` (sign-aware `assert lift > 0` on top of the absolute tolerance).  Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive (the cross-seed `gbm_auc` spread on this bundle is ~0.027); seed-sweep harness lands in PR 6.2's notebook 04.  Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` and `top_decile_rate` mirror `release_quality._precision_at_k` / `_top_decile_rate` (locked in by mirror tests), and `assert_within_tolerance` is hardened against silent passes on non-finite metrics or incomplete per-metric tolerance maps.  G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the intermediate bundle via `python scripts/build_public_release.py release --tier intermediate` (only tier the notebooks need) then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised, gated on bundles-present).  G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`.  Builders (`scripts/build_release_notebook_{01,02}.py`, sharing `scripts/_release_notebook_common.py`) emit deterministic byte-for-byte notebook JSON via explicit `cell_NNN` IDs (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2, locked in by `tests/scripts/test_release_notebook_builders.py` which builds twice into `tmp_path` via the new `--out PATH` flag and diffs against the committed file without ever touching the working tree) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree.  Net: 1250/1250 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
+- [ ] `release/notebooks/{03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb`
 - [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml`
 - [ ] `docs/release/{break_me_guide,v2_decision_log}.md`
 
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 323f858..59f11f1 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -18,7 +18,11 @@ jobs:
       - uses: actions/setup-python@v5
         with:
           python-version: "3.12"
-      - run: pip install ruff
+      # Pin ruff so notebook .ipynb formatting (which the lint job is
+      # strict about) doesn't drift between contributor laptops and CI
+      # the moment a new ruff release ships.  Bump in lock-step with
+      # any local re-runs of ``scripts/build_release_notebook_*.py``.
+      - run: pip install 'ruff==0.15.12'
       - run: ruff check .
       - run: ruff format --check .
 
@@ -138,3 +142,22 @@ jobs:
       - name: Skip v7 (no dataset)
         if: steps.check-v7.outputs.found != 'true'
         run: echo "No v7 datasets found — skipping v7 validation"
+
+  notebooks:
+    name: Execute release notebooks (G13.1)
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - run: pip install -e ".[dev,scripts,notebooks]"
+      - name: Register python3 kernelspec for nbclient
+        run: python -m ipykernel install --user --name python3
+      - name: Build the intermediate public bundle (only tier the notebooks need)
+        run: python scripts/build_public_release.py release --tier intermediate
+      - name: Execute release notebooks end-to-end + builder byte-stability
+        run: |
+          pytest tests/release/notebooks/test_execute_notebooks.py \
+                 tests/scripts/test_release_notebook_builders.py \
+                 -v
diff --git a/pyproject.toml b/pyproject.toml
index b7f1de8..33d2b67 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -54,6 +54,25 @@ publish = [
     "datasets>=2.14",
     "kaggle>=1.6",
 ]
+# Optional dependencies for executing the public release notebooks.
+# Installing this extra (``pip install -e ".[notebooks]"``) enables the
+# G13.1 acceptance gate: the CI ``notebooks`` job nbclient-executes
+# ``release/notebooks/*.ipynb`` end-to-end against a freshly built
+# public bundle, asserting the notebooks reproduce validation_report.md
+# metrics within ±0.05 (G13.2) and never load instructor artefacts
+# (G13.3).
+notebooks = [
+    "nbclient>=0.10",
+    "nbformat>=5.10",
+    # ``ipykernel`` provides the ``python3`` kernelspec that
+    # ``nbclient.NotebookClient(..., kernel_name="python3")`` looks up.
+    # Without it CI fails with ``NoSuchKernel: No such kernel named
+    # python3`` because the GitHub-hosted runner has no kernelspecs
+    # registered out of the box (local dev environments usually do).
+    "ipykernel>=6.0",
+    "scikit-learn>=1.3",
+    "matplotlib>=3.7",
+]
 
 [project.scripts]
 leadforge = "leadforge.cli.main:app"
@@ -74,6 +93,16 @@ select = ["E", "F", "I", "N", "W", "UP", "B", "C4", "PT", "S"]
 
 [tool.ruff.lint.per-file-ignores]
 "tests/**/*" = ["S101", "S108"]
+# Release notebooks deliberately use ``assert`` for contract checks
+# (path discipline, snapshot-safe joins, G13.2 tolerance gate).  ``assert``
+# is the conventional notebook idiom and these are exactly the cells we
+# want to fail loud on regression.
+"release/notebooks/**/*.ipynb" = ["S101"]
+# Notebook builders emit dedented heredocs whose lines render as
+# markdown tables and print-statement output inside the notebook.
+# Line length is a property of the rendered cell, not the .py source,
+# so 100c is the wrong yardstick here.
+"scripts/build_release_notebook_*.py" = ["E501"]
 
 [tool.mypy]
 python_version = "3.11"
diff --git a/release/notebooks/01_baseline_lead_scoring.ipynb b/release/notebooks/01_baseline_lead_scoring.ipynb
index 8861584..e6d6b34 100644
--- a/release/notebooks/01_baseline_lead_scoring.ipynb
+++ b/release/notebooks/01_baseline_lead_scoring.ipynb
@@ -2,354 +2,355 @@
  "cells": [
   {
    "cell_type": "markdown",
+   "id": "cell_000",
    "metadata": {},
-   "source": [
-    "# Baseline Lead Scoring Models\n",
-    "\n",
-    "This notebook trains baseline models on the **LeadForge B2B Lead Scoring** dataset.\n",
-    "It works directly from the pre-generated Parquet files -- no `leadforge` installation required.\n",
-    "\n",
-    "We'll cover:\n",
-    "1. Loading the task splits\n",
-    "2. Exploring the features\n",
-    "3. Training Logistic Regression and Gradient Boosting baselines\n",
-    "4. Evaluating with AUC, PR-AUC, and Precision@K\n",
-    "5. Value-aware ranking (probability vs. expected value)\n",
-    "6. Feature importance\n",
-    "\n",
-    "**Requirements:** `pandas`, `scikit-learn`, `matplotlib` (all available in Kaggle notebooks by default)."
-   ]
+   "source": "# Notebook 01 — Baseline Lead Scoring\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier (the\nrelease default).\n\n**Goal:** train Logistic Regression and Histogram Gradient Boosting\nbaselines on the snapshot-safe public bundle, and verify they\nreproduce the cross-seed-median metrics in\n[`release/validation/validation_report.md`](../validation/validation_report.md)\nwithin the per-metric tolerances fixed by acceptance gate **G13.2**.\n\n**Public path discipline (G13.3).** This notebook reads only from\nthe public bundle at `release/intermediate/`. The instructor\ncompanion (`release/intermediate_instructor/`, with full-horizon\nevent tables, the latent registry, the hidden DAG, and the\nmechanism summary) is **not** loaded — public modelling work must\nnever depend on instructor-only artefacts."
   },
   {
    "cell_type": "markdown",
+   "id": "cell_001",
    "metadata": {},
-   "source": [
-    "## 1. Load the data\n",
-    "\n",
-    "We use the `intermediate` difficulty tier. Change the path to `intro/` or `advanced/` to try other tiers."
-   ]
+   "source": "## 1. Setup"
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cell_002",
    "metadata": {},
    "outputs": [],
    "source": [
+    "from __future__ import annotations\n",
+    "\n",
+    "import json\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
     "import numpy as np\n",
     "import pandas as pd\n",
+    "from sklearn.compose import ColumnTransformer\n",
+    "from sklearn.ensemble import HistGradientBoostingClassifier\n",
+    "from sklearn.impute import SimpleImputer\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.metrics import (\n",
+    "    average_precision_score,\n",
+    "    brier_score_loss,\n",
+    "    roc_auc_score,\n",
+    ")\n",
+    "from sklearn.pipeline import Pipeline\n",
+    "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n",
     "\n",
-    "# Adjust this path to your dataset location\n",
-    "BUNDLE = \"../intermediate\"\n",
-    "TASK = \"converted_within_90_days\"\n",
-    "\n",
-    "train = pd.read_parquet(f\"{BUNDLE}/tasks/{TASK}/train.parquet\")\n",
-    "valid = pd.read_parquet(f\"{BUNDLE}/tasks/{TASK}/valid.parquet\")\n",
-    "test = pd.read_parquet(f\"{BUNDLE}/tasks/{TASK}/test.parquet\")\n",
+    "sys.path.insert(0, str(Path.cwd()))\n",
+    "from _notebook_utils import (\n",
+    "    assert_within_tolerance,\n",
+    "    precision_at_k,\n",
+    "    top_decile_rate,\n",
+    ")\n",
     "\n",
-    "print(f\"Train: {len(train):,} rows\")\n",
-    "print(f\"Valid: {len(valid):,} rows\")\n",
-    "print(f\"Test:  {len(test):,} rows\")\n",
-    "print(\"\\nConversion rates:\")\n",
-    "for name, df in [(\"train\", train), (\"valid\", valid), (\"test\", test)]:\n",
-    "    print(f\"  {name}: {df[TASK].mean():.1%}\")"
+    "SEED = 42\n",
+    "BUNDLE = Path(\"../intermediate\")  # public student bundle\n",
+    "TASK = \"converted_within_90_days\""
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "cell_003",
+   "metadata": {},
+   "source": "## 2. Reproduction targets\n\nWe pin the cross-seed-median metrics for the *intermediate* tier\n(seeds 42–46) from `release/validation/validation_report.json`.\nThe targets live in a sibling file\n(`release/notebooks/_release_targets.json`) so they can't drift\nfrom the validation report without an audit-sync test failure\nin CI.\n\n**Per-metric tolerances** are tighter than a flat 5 % band: the\ncross-seed standard deviation in the report is well under 0.02\non AUC and Brier, and a flat ±0.05 would let a regression slip\nthrough. Average-precision and the small-`k` `top_decile_rate`\nstay at ±0.05 because their seed-to-seed variance is larger."
+  },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cell_004",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Feature dictionary\n",
-    "feat_dict = pd.read_csv(f\"{BUNDLE}/feature_dictionary.csv\")\n",
-    "feat_dict"
+    "with (Path.cwd() / \"_release_targets.json\").open() as fh:\n",
+    "    targets = json.load(fh)[\"intermediate\"]\n",
+    "\n",
+    "# Re-key the validation report's metric names into the metric\n",
+    "# names this notebook prints below, so the gate compares apples\n",
+    "# to apples.\n",
+    "VALIDATION_REPORT_TARGETS = {\n",
+    "    \"lr_auc\": targets[\"lr_auc\"],\n",
+    "    \"gbm_auc\": targets[\"gbm_auc\"],\n",
+    "    \"lr_average_precision\": targets[\"lr_average_precision\"],\n",
+    "    \"lr_brier\": targets[\"brier_score\"],\n",
+    "    \"lr_top_decile_rate\": targets[\"top_decile_rate\"],\n",
+    "}\n",
+    "TOLERANCES = {\n",
+    "    \"lr_auc\": 0.02,  # G13.2 — tighter than a flat 5%\n",
+    "    \"gbm_auc\": 0.02,\n",
+    "    \"lr_average_precision\": 0.05,  # higher seed variance\n",
+    "    \"lr_brier\": 0.02,\n",
+    "    \"lr_top_decile_rate\": 0.05,  # small-k variance\n",
+    "}\n",
+    "for k, v in VALIDATION_REPORT_TARGETS.items():\n",
+    "    print(f\"  target  {k:<24s} {v:.4f}  (tol ±{TOLERANCES[k]:.2f})\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell_005",
    "metadata": {},
-   "source": [
-    "## 2. Explore the features"
-   ]
+   "source": "## 3. Load the bundle\n\nWe load the parquet task splits — the canonical format the\nrelease ships in. The accompanying `lead_scoring.csv` is a\nconvenience export with the same rows but coerced dtypes;\nsticking with parquet preserves nullable `Int64` / `Float64` /\n`boolean` columns the way the validator sees them."
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cell_006",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Identify feature types\n",
-    "TARGET = TASK\n",
-    "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n",
-    "LEAKAGE_COLS = [c for c in train.columns if feat_dict[feat_dict[\"name\"] == c][\"leakage_risk\"].any()]\n",
+    "train = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"train.parquet\")\n",
+    "valid = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"valid.parquet\")\n",
+    "test = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"test.parquet\")\n",
     "\n",
-    "print(f\"Leakage-flagged columns (excluded): {LEAKAGE_COLS}\")\n",
+    "with (BUNDLE / \"manifest.json\").open() as fh:\n",
+    "    manifest = json.load(fh)\n",
     "\n",
-    "feature_cols = [c for c in train.columns if c not in ID_COLS + [TARGET] + LEAKAGE_COLS]\n",
-    "cat_cols = [c for c in feature_cols if train[c].dtype == \"string\" or train[c].dtype == \"object\"]\n",
-    "bool_cols = [c for c in feature_cols if train[c].dtype == \"boolean\"]\n",
-    "num_cols = [c for c in feature_cols if c not in cat_cols + bool_cols]\n",
+    "assert manifest[\"exposure_mode\"] == \"student_public\", \"this notebook expects the public bundle\"\n",
+    "assert manifest[\"relational_snapshot_safe\"] is True\n",
     "\n",
-    "print(f\"\\nCategorical: {len(cat_cols)} -- {cat_cols}\")\n",
-    "print(f\"Boolean:     {len(bool_cols)} -- {bool_cols}\")\n",
-    "print(f\"Numeric:     {len(num_cols)} -- {num_cols}\")"
+    "print(f\"Train: {len(train):,} rows\")\n",
+    "print(f\"Valid: {len(valid):,} rows  (held out — not used here)\")\n",
+    "print(f\"Test:  {len(test):,} rows\")\n",
+    "print()\n",
+    "print(f\"Bundle exposure_mode: {manifest['exposure_mode']}\")\n",
+    "print(f\"Bundle snapshot_day:  {manifest['snapshot_day']}\")\n",
+    "print(f\"Bundle horizon_days:  {manifest['horizon_days']}\")\n",
+    "print()\n",
+    "print(\"Conversion rates:\")\n",
+    "for name, df in [(\"train\", train), (\"valid\", valid), (\"test\", test)]:\n",
+    "    print(f\"  {name}: {df[TASK].mean():.1%}\")"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
+   "id": "cell_007",
    "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Missing values\n",
-    "missing = train[feature_cols].isnull().sum()\n",
-    "missing = missing[missing > 0].sort_values(ascending=False)\n",
-    "if len(missing) > 0:\n",
-    "    print(\"Missing values in training set:\")\n",
-    "    for col, count in missing.items():\n",
-    "        print(f\"  {col}: {count} ({count / len(train):.1%})\")\n",
-    "else:\n",
-    "    print(\"No missing values in training set.\")"
-   ]
+   "source": "## 4. Feature selection\n\nWe use the **same feature set as `release/validation/validation_report.json`**\nso the gate in section 7 is a real reproduction check rather\nthan a related-but-different number. That means we drop only\nthe IDs and the label — every other column in `train` (including\n`total_touches_all`, the documented leakage trap) goes into the\npipeline.\n\n**About `total_touches_all`.** The feature dictionary flags it\nwith `leakage_risk = True`: it counts touches over the full\n90-day horizon, which is post-snapshot data. The validation\nreport keeps it in the panel anyway because (a) its standalone\nAUC is barely above 0.55 (see the *post_snapshot_aggregates*\nbaseline column in the report) and (b) the report exists to\nmeasure the v1 dataset's *as-shipped* difficulty, leakage trap\nincluded. **Notebook 03** *(coming in PR 6.2)* walks through\nwhat dropping the trap does to performance and how to detect\nsimilar traps from feature audits alone."
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cell_008",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Summary statistics for numeric features\n",
-    "train[num_cols].describe().T"
+    "feat_dict = pd.read_csv(BUNDLE / \"feature_dictionary.csv\")\n",
+    "trap_cols = feat_dict.loc[feat_dict[\"leakage_risk\"].astype(bool), \"name\"].tolist()\n",
+    "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n",
+    "# Mirrors ``release_quality._partition_columns`` — IDs + label only.\n",
+    "EXCLUDE = set(ID_COLS + [TASK])\n",
+    "\n",
+    "feature_cols = [c for c in train.columns if c not in EXCLUDE]\n",
+    "cat_cols = [\n",
+    "    c\n",
+    "    for c in feature_cols\n",
+    "    if not (pd.api.types.is_bool_dtype(train[c]) or pd.api.types.is_numeric_dtype(train[c]))\n",
+    "]\n",
+    "num_cols = [c for c in feature_cols if c not in cat_cols]\n",
+    "\n",
+    "print(f\"Leakage-trap columns kept (see narrative above): {trap_cols}\")\n",
+    "print(f\"Categorical features ({len(cat_cols)}): {cat_cols}\")\n",
+    "print(f\"Numeric features    ({len(num_cols)}): {num_cols}\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell_009",
    "metadata": {},
-   "source": [
-    "## 3. Build preprocessing pipeline"
-   ]
+   "source": "## 5. Preprocessing pipeline\n\nMirrors `leadforge.validation.release_quality._build_pipeline`\nso the notebook's metric panel and the validation report's\nmetric panel agree by construction:\n\n- numeric: median-impute, then `StandardScaler`\n- categorical: most-frequent-impute, then dense `OneHotEncoder`\n  with `handle_unknown=\"ignore\"`"
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cell_010",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from sklearn.compose import ColumnTransformer\n",
-    "from sklearn.impute import SimpleImputer\n",
-    "from sklearn.pipeline import Pipeline\n",
-    "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n",
-    "\n",
-    "\n",
-    "# Convert boolean columns to int for sklearn\n",
-    "def prep_df(df):\n",
-    "    out = df[feature_cols].copy()\n",
-    "    for c in bool_cols:\n",
-    "        out[c] = out[c].astype(\"Int64\")\n",
+    "def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame:\n",
+    "    out = df.copy()\n",
+    "    for c in cat_cols:\n",
+    "        out[c] = out[c].astype(object).where(out[c].notna(), None)\n",
     "    return out\n",
     "\n",
     "\n",
-    "numeric_features = num_cols + bool_cols\n",
-    "categorical_features = cat_cols\n",
+    "x_train = _sanitize_categoricals(train[feature_cols])\n",
+    "x_test = _sanitize_categoricals(test[feature_cols])\n",
+    "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n",
+    "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n",
     "\n",
-    "preprocessor = ColumnTransformer(\n",
-    "    transformers=[\n",
-    "        (\n",
-    "            \"num\",\n",
-    "            Pipeline(\n",
-    "                [\n",
-    "                    (\"imputer\", SimpleImputer(strategy=\"median\")),\n",
-    "                    (\"scaler\", StandardScaler()),\n",
-    "                ]\n",
-    "            ),\n",
-    "            numeric_features,\n",
-    "        ),\n",
-    "        (\n",
-    "            \"cat\",\n",
-    "            Pipeline(\n",
-    "                [\n",
-    "                    (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n",
-    "                    (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n",
-    "                ]\n",
-    "            ),\n",
-    "            categorical_features,\n",
-    "        ),\n",
+    "numeric_t = Pipeline([(\"imputer\", SimpleImputer(strategy=\"median\")), (\"scaler\", StandardScaler())])\n",
+    "categorical_t = Pipeline(\n",
+    "    [\n",
+    "        (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n",
+    "        (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n",
     "    ]\n",
     ")\n",
-    "\n",
-    "X_train = prep_df(train)\n",
-    "y_train = train[TARGET].astype(int)\n",
-    "X_test = prep_df(test)\n",
-    "y_test = test[TARGET].astype(int)\n",
-    "\n",
-    "print(f\"X_train: {X_train.shape}, X_test: {X_test.shape}\")"
+    "preprocessor = ColumnTransformer(\n",
+    "    transformers=[\n",
+    "        (\"num\", numeric_t, num_cols),\n",
+    "        (\"cat\", categorical_t, cat_cols),\n",
+    "    ],\n",
+    "    remainder=\"drop\",\n",
+    ")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell_011",
    "metadata": {},
-   "source": [
-    "## 4. Train baselines and evaluate"
-   ]
+   "source": "## 6. Train baselines and score the test split"
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cell_012",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from sklearn.ensemble import GradientBoostingClassifier\n",
-    "from sklearn.linear_model import LogisticRegression\n",
-    "from sklearn.metrics import average_precision_score, roc_auc_score\n",
+    "lr_pipe = Pipeline(\n",
+    "    [\n",
+    "        (\"preprocessor\", preprocessor),\n",
+    "        (\n",
+    "            \"classifier\",\n",
+    "            LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=SEED),\n",
+    "        ),\n",
+    "    ]\n",
+    ")\n",
+    "gbm_pipe = Pipeline(\n",
+    "    [\n",
+    "        (\"preprocessor\", preprocessor),\n",
+    "        (\"classifier\", HistGradientBoostingClassifier(random_state=SEED)),\n",
+    "    ]\n",
+    ")\n",
     "\n",
-    "models = {\n",
-    "    \"Logistic Regression\": LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=42),\n",
-    "    \"Gradient Boosting\": GradientBoostingClassifier(n_estimators=200, random_state=42),\n",
+    "lr_pipe.fit(x_train, y_train)\n",
+    "gbm_pipe.fit(x_train, y_train)\n",
+    "\n",
+    "lr_probs = lr_pipe.predict_proba(x_test)[:, 1]\n",
+    "gbm_probs = gbm_pipe.predict_proba(x_test)[:, 1]\n",
+    "\n",
+    "metrics = {\n",
+    "    \"lr_auc\": float(roc_auc_score(y_test, lr_probs)),\n",
+    "    \"gbm_auc\": float(roc_auc_score(y_test, gbm_probs)),\n",
+    "    \"lr_average_precision\": float(average_precision_score(y_test, lr_probs)),\n",
+    "    \"lr_brier\": float(brier_score_loss(y_test, lr_probs)),\n",
+    "    \"lr_top_decile_rate\": top_decile_rate(lr_probs, y_test),\n",
+    "    # Print-only; not pinned (the validation report tracks\n",
+    "    # ``top_decile_rate`` instead, which we gate above).\n",
+    "    \"lr_precision_at_50\": precision_at_k(lr_probs, y_test, 50),\n",
+    "    \"lr_precision_at_100\": precision_at_k(lr_probs, y_test, 100),\n",
+    "    \"lr_precision_at_200\": precision_at_k(lr_probs, y_test, 200),\n",
     "}\n",
-    "\n",
-    "results = []\n",
-    "fitted_models = {}\n",
-    "\n",
-    "for name, model in models.items():\n",
-    "    pipe = Pipeline([(\"preprocess\", preprocessor), (\"model\", model)])\n",
-    "    pipe.fit(X_train, y_train)\n",
-    "    y_prob = pipe.predict_proba(X_test)[:, 1]\n",
-    "\n",
-    "    auc = roc_auc_score(y_test, y_prob)\n",
-    "    pr_auc = average_precision_score(y_test, y_prob)\n",
-    "\n",
-    "    # Precision@K\n",
-    "    for k in [25, 50, 100]:\n",
-    "        top_k_idx = np.argsort(-y_prob)[:k]\n",
-    "        p_at_k = y_test.iloc[top_k_idx].mean()\n",
-    "        base_rate = y_test.mean()\n",
-    "        lift = p_at_k / base_rate\n",
-    "        results.append(\n",
-    "            {\n",
-    "                \"Model\": name,\n",
-    "                \"Metric\": f\"P@{k}\",\n",
-    "                \"Value\": f\"{p_at_k:.3f}\",\n",
-    "                \"Lift\": f\"{lift:.2f}x\",\n",
-    "            }\n",
-    "        )\n",
-    "\n",
-    "    results.append({\"Model\": name, \"Metric\": \"ROC-AUC\", \"Value\": f\"{auc:.3f}\", \"Lift\": \"\"})\n",
-    "    results.append({\"Model\": name, \"Metric\": \"PR-AUC\", \"Value\": f\"{pr_auc:.3f}\", \"Lift\": \"\"})\n",
-    "    fitted_models[name] = pipe\n",
-    "    print(f\"{name}: AUC={auc:.3f}, PR-AUC={pr_auc:.3f}\")\n",
-    "\n",
-    "pd.DataFrame(results)"
+    "for k, v in metrics.items():\n",
+    "    print(f\"  {k:<24s} {v:.4f}\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell_013",
    "metadata": {},
-   "source": [
-    "## 5. Value-aware ranking\n",
-    "\n",
-    "When deals have different sizes (`expected_acv`), ranking by probability alone leaves money on the table.\n",
-    "Ranking by expected value (P(convert) x ACV) captures more revenue in the top-K."
-   ]
+   "source": "## 7. Tolerance check (G13.2)\n\nThe notebook's printed metrics must match the cross-seed medians\nin `validation_report.json` to within the per-metric tolerances\ndeclared in section 2. If a future change breaks this, the\nassertion below fails — and CI catches it, because the same\ncell runs under `nbclient` in the `notebooks` job."
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cell_014",
    "metadata": {},
    "outputs": [],
    "source": [
-    "if \"expected_acv\" in test.columns:\n",
-    "    best_pipe = fitted_models[\"Gradient Boosting\"]\n",
-    "    y_prob = best_pipe.predict_proba(X_test)[:, 1]\n",
-    "    acv = test[\"expected_acv\"].fillna(test[\"expected_acv\"].median()).values\n",
-    "    ev = y_prob * acv\n",
-    "\n",
-    "    true_acv = test[\"expected_acv\"].fillna(0).values\n",
-    "    converted = y_test.values.astype(bool)\n",
-    "\n",
-    "    for k in [25, 50, 100]:\n",
-    "        # Probability ranking\n",
-    "        prob_top_k = np.argsort(-y_prob)[:k]\n",
-    "        prob_acv = true_acv[prob_top_k][converted[prob_top_k]].sum()\n",
-    "\n",
-    "        # EV ranking\n",
-    "        ev_top_k = np.argsort(-ev)[:k]\n",
-    "        ev_acv = true_acv[ev_top_k][converted[ev_top_k]].sum()\n",
-    "\n",
-    "        uplift = (ev_acv - prob_acv) / prob_acv * 100 if prob_acv > 0 else 0\n",
-    "        print(\n",
-    "            f\"K={k}: Prob ranking ${prob_acv:,.0f} | \"\n",
-    "            f\"EV ranking ${ev_acv:,.0f} | Uplift: {uplift:+.1f}%\"\n",
-    "        )\n",
-    "else:\n",
-    "    print(\"No expected_acv column found.\")"
+    "assert_within_tolerance(\n",
+    "    observed=metrics,\n",
+    "    target=VALIDATION_REPORT_TARGETS,\n",
+    "    tolerances=TOLERANCES,\n",
+    "    label=\"notebook 01 vs validation_report.json (intermediate tier)\",\n",
+    ")\n",
+    "print(\"OK — all gated metrics are within their per-metric tolerance.\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell_015",
    "metadata": {},
-   "source": [
-    "## 6. Feature importance (Gradient Boosting)"
-   ]
+   "source": "## 8. Decile lift chart\n\nStandard sanity-check for ranking quality: sort the test set by\nscore, bucket into deciles, plot the per-decile conversion rate\nvs the base rate."
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cell_016",
    "metadata": {},
    "outputs": [],
    "source": [
     "import matplotlib.pyplot as plt\n",
     "\n",
-    "gbm_pipe = fitted_models[\"Gradient Boosting\"]\n",
-    "gbm_model = gbm_pipe.named_steps[\"model\"]\n",
-    "preproc = gbm_pipe.named_steps[\"preprocess\"]\n",
-    "\n",
-    "# Get feature names after encoding\n",
-    "num_names = numeric_features\n",
-    "cat_encoder = preproc.named_transformers_[\"cat\"].named_steps[\"encoder\"]\n",
-    "cat_names = list(cat_encoder.get_feature_names_out(categorical_features))\n",
-    "all_names = num_names + cat_names\n",
-    "\n",
-    "importances = gbm_model.feature_importances_\n",
-    "feat_imp = pd.Series(importances, index=all_names).sort_values(ascending=False)\n",
-    "\n",
-    "top_n = 15\n",
-    "fig, ax = plt.subplots(figsize=(8, 5))\n",
-    "feat_imp.head(top_n).plot.barh(ax=ax)\n",
-    "ax.set_xlabel(\"Importance\")\n",
-    "ax.set_title(f\"Top {top_n} Features (Gradient Boosting)\")\n",
-    "ax.invert_yaxis()\n",
+    "order = np.argsort(-lr_probs, kind=\"stable\")\n",
+    "y_sorted = y_test[order]\n",
+    "n = len(y_test)\n",
+    "edges = np.linspace(0, n, 11, dtype=int)\n",
+    "decile_rate = np.array([y_sorted[edges[i] : edges[i + 1]].mean() for i in range(10)])\n",
+    "base_rate = y_test.mean()\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(7, 4))\n",
+    "ax.bar(range(1, 11), decile_rate, color=\"#3b82f6\")\n",
+    "ax.axhline(\n",
+    "    base_rate,\n",
+    "    color=\"#ef4444\",\n",
+    "    linestyle=\"--\",\n",
+    "    label=f\"base rate ({base_rate:.1%})\",\n",
+    ")\n",
+    "ax.set_xticks(range(1, 11))\n",
+    "ax.set_xlabel(\"Score decile (1 = highest)\")\n",
+    "ax.set_ylabel(\"Conversion rate\")\n",
+    "ax.set_title(\"LR decile lift — intermediate tier (seed 42)\")\n",
+    "ax.legend(loc=\"upper right\")\n",
     "plt.tight_layout()\n",
-    "plt.show()\n",
-    "\n",
-    "print(f\"\\nTop {top_n} features:\")\n",
-    "for name, imp in feat_imp.head(top_n).items():\n",
-    "    print(f\"  {name}: {imp:.4f}\")"
+    "plt.show()"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell_017",
+   "metadata": {},
+   "source": "## 9. Calibration plot\n\nReliability diagram: bin predicted probabilities into 10 equal-\nwidth buckets, plot mean predicted vs mean observed. The\nvalidation report's reference reliability plot for the\nintermediate tier lives at\n`release/validation/figures/calibration_intermediate.png`."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_018",
    "metadata": {},
+   "outputs": [],
    "source": [
-    "## 7. Try other difficulty tiers\n",
-    "\n",
-    "Change `BUNDLE` at the top of this notebook to point to `intro/` or `advanced/` and re-run all cells.\n",
-    "You should see:\n",
-    "- **Intro:** Higher AUC (cleaner signal, ~2% missing values)\n",
-    "- **Intermediate:** Moderate AUC (~8% missing values, more noise)\n",
-    "- **Advanced:** Lower AUC (~18% missing values, much noisier)\n",
-    "\n",
-    "## Explore the relational tables\n",
-    "\n",
-    "The flat task splits are derived from 9 relational tables under `tables/`. You can engineer your own features:\n",
-    "\n",
-    "```python\n",
-    "touches = pd.read_parquet(f\"{BUNDLE}/tables/touches.parquet\")\n",
-    "sessions = pd.read_parquet(f\"{BUNDLE}/tables/sessions.parquet\")\n",
-    "# ... join, aggregate, and build features from raw events\n",
-    "```\n",
-    "\n",
-    "See the [leadforge README](https://github.com/leadforge-dev/leadforge) for more details."
+    "edges = np.linspace(0.0, 1.0, 11)\n",
+    "mean_pred = []\n",
+    "mean_actual = []\n",
+    "for i in range(10):\n",
+    "    lo, hi = edges[i], edges[i + 1]\n",
+    "    mask = (lr_probs >= lo) & ((lr_probs <= hi) if i == 9 else (lr_probs < hi))\n",
+    "    if mask.sum() == 0:\n",
+    "        continue\n",
+    "    mean_pred.append(lr_probs[mask].mean())\n",
+    "    mean_actual.append(y_test[mask].mean())\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(5, 5))\n",
+    "ax.plot([0, 1], [0, 1], color=\"#9ca3af\", linestyle=\"--\", label=\"perfect calibration\")\n",
+    "ax.plot(mean_pred, mean_actual, marker=\"o\", color=\"#3b82f6\", label=\"LR\")\n",
+    "ax.set_xlim(0, 1)\n",
+    "ax.set_ylim(0, 1)\n",
+    "ax.set_xlabel(\"Mean predicted probability\")\n",
+    "ax.set_ylabel(\"Observed conversion rate\")\n",
+    "ax.set_title(\"Calibration — LR, intermediate tier (seed 42)\")\n",
+    "ax.legend(loc=\"upper left\")\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_019",
+   "metadata": {},
+   "source": "## 10. Next\n\n- **Notebook 02** — engineer features by joining the snapshot-\n  safe relational tables under `release/intermediate/tables/`,\n  then measure the lift over the flat-CSV LR baseline above.\n- **Notebook 03** *(coming in PR 6.2)* — leakage and time-window\n  walkthrough; works through what `total_touches_all` does to\n  your AUC if you forget to drop it.\n- **Notebook 04** *(coming in PR 6.2)* — value-aware ranking\n  (`expected_acv` × P(convert)), threshold selection, and the\n  cohort-shift stress test."
   }
  ],
  "metadata": {
@@ -360,9 +361,9 @@
   },
   "language_info": {
    "name": "python",
-   "version": "3.11.0"
+   "version": "3.11"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 4
+ "nbformat_minor": 5
 }
diff --git a/release/notebooks/02_relational_feature_engineering.ipynb b/release/notebooks/02_relational_feature_engineering.ipynb
new file mode 100644
index 0000000..84a3b02
--- /dev/null
+++ b/release/notebooks/02_relational_feature_engineering.ipynb
@@ -0,0 +1,541 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "cell_000",
+   "metadata": {},
+   "source": "# Notebook 02 — Relational Feature Engineering\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.\n\nThe flat task split in notebook 01 is one snapshot view of a\nricher relational world. The public bundle ships seven\n**snapshot-safe** tables under `release/intermediate/tables/`:\nevery event row is filtered to `timestamp <= lead_created_at +\nsnapshot_day`, so any join you can write is **leakage-safe by\nconstruction**.\n\nThis notebook walks through:\n\n1. Loading the seven public tables.\n2. Verifying the snapshot-safe contract inline (the teachable\n   moment — see the contract pass, don't just read about it).\n3. Engineering four relational features, with train-only\n   discipline for any aggregation that crosses splits.\n4. Training a HistGBM on `flat ∪ engineered` columns.\n5. Reporting the AUC / AP / Brier / P@K delta vs the flat-CSV\n   baseline, with a tolerance gate that fails CI if the\n   headline lift regresses.\n\n**Public path discipline (G13.3).** This notebook reads only\nfrom `release/intermediate/`. The instructor companion\n(`release/intermediate_instructor/`) is **not** loaded —\nrelational feature engineering must work from the public\nartefact alone. Tables omitted from the public bundle on\npurpose (`customers`, `subscriptions`) live only in the\ninstructor companion because their mere existence reconstructs\nthe label.\n\n**Leakage-trap discipline.** Unlike notebook 01 (which\nreproduces the validation report's panel verbatim and\ntherefore keeps `total_touches_all`), notebook 02 **drops**\nthe trap from the flat baseline. We're teaching feature\nengineering here; mixing a known-leaky column into the\n\"before\" panel would muddy the relational lift attribution."
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_001",
+   "metadata": {},
+   "source": "## 1. Setup"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_002",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from __future__ import annotations\n",
+    "\n",
+    "import json\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from sklearn.compose import ColumnTransformer\n",
+    "from sklearn.ensemble import HistGradientBoostingClassifier\n",
+    "from sklearn.impute import SimpleImputer\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.metrics import (\n",
+    "    average_precision_score,\n",
+    "    brier_score_loss,\n",
+    "    roc_auc_score,\n",
+    ")\n",
+    "from sklearn.pipeline import Pipeline\n",
+    "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n",
+    "\n",
+    "sys.path.insert(0, str(Path.cwd()))\n",
+    "from _notebook_utils import assert_within_tolerance, precision_at_k\n",
+    "\n",
+    "SEED = 42\n",
+    "BUNDLE = Path(\"../intermediate\")  # public student bundle\n",
+    "TASK = \"converted_within_90_days\"\n",
+    "\n",
+    "with (BUNDLE / \"manifest.json\").open() as fh:\n",
+    "    manifest = json.load(fh)\n",
+    "assert manifest[\"exposure_mode\"] == \"student_public\"\n",
+    "assert manifest[\"relational_snapshot_safe\"] is True\n",
+    "SNAPSHOT_DAY = int(manifest[\"snapshot_day\"])\n",
+    "print(f\"snapshot_day = {SNAPSHOT_DAY}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_003",
+   "metadata": {},
+   "source": "## 2. Load the seven public tables\n\nThese are the only tables present under `release/intermediate/\ntables/`. `customers` and `subscriptions` are deliberately\nabsent — they only exist for *converted* leads, so their\npresence in a public bundle would reconstruct the label."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_004",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "PUBLIC_TABLES = (\n",
+    "    \"accounts\",\n",
+    "    \"contacts\",\n",
+    "    \"leads\",\n",
+    "    \"touches\",\n",
+    "    \"sessions\",\n",
+    "    \"sales_activities\",\n",
+    "    \"opportunities\",\n",
+    ")\n",
+    "tables = {name: pd.read_parquet(BUNDLE / \"tables\" / f\"{name}.parquet\") for name in PUBLIC_TABLES}\n",
+    "for name in PUBLIC_TABLES:\n",
+    "    print(f\"  {name:<20s} {tables[name].shape}\")\n",
+    "\n",
+    "train = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"train.parquet\")\n",
+    "test = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"test.parquet\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_005",
+   "metadata": {},
+   "source": "## 3. Verify the snapshot-safe contract\n\nFor every event-table row joined back to its lead, assert\n`timestamp <= lead_created_at + snapshot_day`. The reported\n**headroom** is the *minimum* gap any event row leaves between\nits timestamp and the snapshot cutoff — a non-negative number\nwhen the contract holds. Showing the actual minimum (rather\nthan just \"all <= cutoff\") makes the receipt honest: if a\nfuture regeneration ever shaves the contract close, you'll see\nthe headroom shrink before the assertion fires."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_006",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "leads_index = tables[\"leads\"][[\"lead_id\", \"lead_created_at\"]].copy()\n",
+    "leads_index[\"lead_created_at\"] = pd.to_datetime(leads_index[\"lead_created_at\"])\n",
+    "cutoff = leads_index.assign(cutoff=leads_index[\"lead_created_at\"] + pd.Timedelta(days=SNAPSHOT_DAY))\n",
+    "\n",
+    "EVENT_TABLES = [\n",
+    "    (\"touches\", \"touch_timestamp\"),\n",
+    "    (\"sessions\", \"session_timestamp\"),\n",
+    "    (\"sales_activities\", \"activity_timestamp\"),\n",
+    "    (\"opportunities\", \"created_at\"),\n",
+    "]\n",
+    "\n",
+    "for tbl, ts_col in EVENT_TABLES:\n",
+    "    df = tables[tbl][[\"lead_id\", ts_col]].merge(\n",
+    "        cutoff[[\"lead_id\", \"cutoff\"]], on=\"lead_id\", how=\"left\"\n",
+    "    )\n",
+    "    df[ts_col] = pd.to_datetime(df[ts_col])\n",
+    "    violations = df[df[ts_col] > df[\"cutoff\"]]\n",
+    "    assert len(violations) == 0, f\"{tbl}.{ts_col}: {len(violations)} rows past snapshot cutoff\"\n",
+    "    min_headroom = (df[\"cutoff\"] - df[ts_col]).min()\n",
+    "    min_headroom_days = float(min_headroom.total_seconds()) / 86400.0\n",
+    "    print(\n",
+    "        f\"  {tbl}.{ts_col}: {len(df):>6,} rows; \"\n",
+    "        f\"min headroom under cutoff: {min_headroom_days:6.2f} days\"\n",
+    "    )\n",
+    "\n",
+    "print()\n",
+    "print(\"OK — every event row in every public event table satisfies\")\n",
+    "print(f\"     timestamp <= lead_created_at + {SNAPSHOT_DAY} days.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_007",
+   "metadata": {},
+   "source": "## 4. Engineered features\n\nWe build four relational features. Each one starts as a\nper-lead aggregate over a single snapshot-safe table, then\njoins back into the per-lead snapshot. Aggregations that pool\nacross leads (account-level density, target encoding) are\n**fit on the train split only** and applied to test via\njoin — same train-only discipline that prevents target leakage\nin mean encoding.\n\n| # | Feature | Source table(s) | Aggregation |\n|---|---|---|---|\n| 1 | `touches_ch_*` (3 cols) | `touches` | per-lead × `touch_channel` count |\n| 2 | `account_avg_touches_per_lead` | `touches`, `leads`, train lead set | account-level rollup over **train leads only**, then merge back |\n| 3 | `days_since_last_activity` | `sales_activities`, `leads` | per-lead recency at snapshot cutoff |\n| 4 | `industry_target_encoding_train` | `accounts`, train | mean-target encoding **fit on train only** |"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_008",
+   "metadata": {},
+   "source": "### 4.1 Touch-channel breakdown"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_009",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "touches = tables[\"touches\"]\n",
+    "channel_counts = touches.groupby([\"lead_id\", \"touch_channel\"]).size().unstack(fill_value=0)\n",
+    "channel_counts.columns = [f\"touches_ch_{c}\" for c in channel_counts.columns]\n",
+    "print(f\"channel feature columns: {list(channel_counts.columns)}\")\n",
+    "channel_counts.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_010",
+   "metadata": {},
+   "source": "### 4.2 Account-level touch density\n\nHow active is the account this lead belongs to, on average? An\naccount with many leads and many touches per lead is a\nstructurally different prospect than a one-touch account.\n\n**Train-only.** We compute the account-level rollup using only\ntrain leads' touches. Test leads in train-only-empty accounts\nfall back to 0 via `fillna`. This avoids letting test rows\ninfluence the train feature distribution — same discipline\napplied to mean-target encoding in 4.4."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_011",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_lead_ids = set(train[\"lead_id\"].tolist())\n",
+    "tch_with_acct = touches.merge(tables[\"leads\"][[\"lead_id\", \"account_id\"]], on=\"lead_id\", how=\"left\")\n",
+    "tch_train = tch_with_acct[tch_with_acct[\"lead_id\"].isin(train_lead_ids)]\n",
+    "account_density = (\n",
+    "    tch_train.groupby(\"account_id\")\n",
+    "    .agg(\n",
+    "        account_total_touches=(\"touch_id\", \"count\"),\n",
+    "        account_lead_count=(\"lead_id\", \"nunique\"),\n",
+    "    )\n",
+    "    .assign(\n",
+    "        account_avg_touches_per_lead=lambda d: d[\"account_total_touches\"] / d[\"account_lead_count\"]\n",
+    "    )\n",
+    "    .reset_index()\n",
+    ")\n",
+    "print(f\"account_density rows: {len(account_density):,}  (accounts represented in train touches)\")\n",
+    "account_density.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_012",
+   "metadata": {},
+   "source": "### 4.3 Sales-activity recency at snapshot\n\nDays between the lead's most recent sales activity and the\nsnapshot cutoff (`lead_created_at + snapshot_day`). Recency\nis a classic engagement signal that's surprisingly hard to\nrecover from the flat snapshot directly."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_013",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sa = tables[\"sales_activities\"][[\"lead_id\", \"activity_timestamp\"]].copy()\n",
+    "sa[\"activity_timestamp\"] = pd.to_datetime(sa[\"activity_timestamp\"])\n",
+    "last_activity = (\n",
+    "    sa.groupby(\"lead_id\")[\"activity_timestamp\"]\n",
+    "    .max()\n",
+    "    .reset_index()\n",
+    "    .rename(columns={\"activity_timestamp\": \"last_activity_at\"})\n",
+    ")\n",
+    "last_activity = last_activity.merge(cutoff[[\"lead_id\", \"cutoff\"]], on=\"lead_id\")\n",
+    "last_activity[\"days_since_last_activity\"] = (\n",
+    "    last_activity[\"cutoff\"] - last_activity[\"last_activity_at\"]\n",
+    ").dt.total_seconds() / 86400\n",
+    "last_activity[[\"lead_id\", \"days_since_last_activity\"]].head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_014",
+   "metadata": {},
+   "source": "### 4.4 Industry target encoding (train-only, leakage-safe)\n\nReplace the `industry` string with the conversion rate observed\nfor that industry **on the training split only**. Computing\nthe encoding on test leaks the test labels into the features —\na textbook mistake; we avoid it explicitly."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_015",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tgt_enc = train.groupby(\"industry\")[TASK].mean().to_dict()\n",
+    "tgt_enc_global_mean = float(train[TASK].mean())\n",
+    "print(f\"per-industry train conversion rates ({len(tgt_enc)} industries):\")\n",
+    "for k, v in sorted(tgt_enc.items()):\n",
+    "    print(f\"  {k:<20s} {v:.3f}\")\n",
+    "print(f\"  fallback global mean: {tgt_enc_global_mean:.3f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_016",
+   "metadata": {},
+   "source": "### 4.5 Stitch features onto train and test"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_017",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ENGINEERED_NUMERIC = list(channel_counts.columns) + [\n",
+    "    \"account_avg_touches_per_lead\",\n",
+    "    \"days_since_last_activity\",\n",
+    "    \"industry_target_encoding_train\",\n",
+    "]\n",
+    "\n",
+    "\n",
+    "def attach_engineered(df: pd.DataFrame) -> pd.DataFrame:\n",
+    "    out = df.copy()\n",
+    "    out = out.merge(channel_counts, on=\"lead_id\", how=\"left\")\n",
+    "    for col in channel_counts.columns:\n",
+    "        out[col] = out[col].fillna(0).astype(int)\n",
+    "    out = out.merge(\n",
+    "        account_density[[\"account_id\", \"account_avg_touches_per_lead\"]],\n",
+    "        on=\"account_id\",\n",
+    "        how=\"left\",\n",
+    "    )\n",
+    "    out[\"account_avg_touches_per_lead\"] = (\n",
+    "        out[\"account_avg_touches_per_lead\"].fillna(0).astype(float)\n",
+    "    )\n",
+    "    out = out.merge(\n",
+    "        last_activity[[\"lead_id\", \"days_since_last_activity\"]],\n",
+    "        on=\"lead_id\",\n",
+    "        how=\"left\",\n",
+    "    )\n",
+    "    out[\"industry_target_encoding_train\"] = out[\"industry\"].map(tgt_enc).fillna(tgt_enc_global_mean)\n",
+    "    return out\n",
+    "\n",
+    "\n",
+    "train_eng = attach_engineered(train)\n",
+    "test_eng = attach_engineered(test)\n",
+    "print(f\"train_eng shape: {train_eng.shape}  ({train.shape[1]} -> {train_eng.shape[1]} cols)\")\n",
+    "print(f\"new columns: {ENGINEERED_NUMERIC}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_018",
+   "metadata": {},
+   "source": "## 5. Baseline + engineered models\n\nSame pipeline as notebook 01 (mirrors\n`leadforge.validation.release_quality._build_pipeline`). We\ntrain four models so the comparison is fair:\n\n| Model | Features | Compares against |\n|---|---|---|\n| LR-flat   | flat snapshot, no trap            | (validation report baseline) |\n| GBM-flat  | flat snapshot, no trap            | LR-flat |\n| LR-eng    | flat ∪ engineered, no trap, no raw `industry` | LR-flat |\n| GBM-eng   | flat ∪ engineered, no trap, no raw `industry` | GBM-flat — the headline lift |\n\nThe `+rel` pipelines drop the raw `industry` categorical\nbecause the train-only target encoding already represents it\nas a numeric column — feeding both would feed the same column\ntwice."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_019",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "feat_dict = pd.read_csv(BUNDLE / \"feature_dictionary.csv\")\n",
+    "trap_cols = feat_dict.loc[feat_dict[\"leakage_risk\"].astype(bool), \"name\"].tolist()\n",
+    "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n",
+    "EXCLUDE = set(ID_COLS + trap_cols + [TASK])\n",
+    "\n",
+    "base_cols = [c for c in train.columns if c not in EXCLUDE]\n",
+    "cat_base = [\n",
+    "    c\n",
+    "    for c in base_cols\n",
+    "    if not (pd.api.types.is_bool_dtype(train[c]) or pd.api.types.is_numeric_dtype(train[c]))\n",
+    "]\n",
+    "num_base = [c for c in base_cols if c not in cat_base]\n",
+    "# Drop raw `industry` from the +rel categorical list so the LR\n",
+    "# pipeline doesn't see it twice (one-hot + target-encoded).\n",
+    "cat_eng = [c for c in cat_base if c != \"industry\"]\n",
+    "print(f\"flat features: {len(base_cols)}  (numeric={len(num_base)}, categorical={len(cat_base)})\")\n",
+    "print(f\"engineered (numeric only): {len(ENGINEERED_NUMERIC)}\")\n",
+    "print(f\"+rel categorical list drops: {set(cat_base) - set(cat_eng)}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_020",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def _sanitize(df: pd.DataFrame, cat_cols: list[str]) -> pd.DataFrame:\n",
+    "    out = df.copy()\n",
+    "    for c in cat_cols:\n",
+    "        out[c] = out[c].astype(object).where(out[c].notna(), None)\n",
+    "    return out\n",
+    "\n",
+    "\n",
+    "def build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline:\n",
+    "    numeric_t = Pipeline(\n",
+    "        [\n",
+    "            (\"imputer\", SimpleImputer(strategy=\"median\")),\n",
+    "            (\"scaler\", StandardScaler()),\n",
+    "        ]\n",
+    "    )\n",
+    "    cat_t = Pipeline(\n",
+    "        [\n",
+    "            (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n",
+    "            (\n",
+    "                \"encoder\",\n",
+    "                OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False),\n",
+    "            ),\n",
+    "        ]\n",
+    "    )\n",
+    "    pre = ColumnTransformer(\n",
+    "        [(\"num\", numeric_t, num_cols), (\"cat\", cat_t, cat_cols)],\n",
+    "        remainder=\"drop\",\n",
+    "    )\n",
+    "    if model == \"lr\":\n",
+    "        clf = LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=SEED)\n",
+    "    else:\n",
+    "        clf = HistGradientBoostingClassifier(random_state=SEED)\n",
+    "    return Pipeline([(\"preprocessor\", pre), (\"classifier\", clf)])\n",
+    "\n",
+    "\n",
+    "def fit_and_score(\n",
+    "    x_train_df: pd.DataFrame,\n",
+    "    x_test_df: pd.DataFrame,\n",
+    "    num_cols: list[str],\n",
+    "    cat_cols: list[str],\n",
+    "    *,\n",
+    "    model: str,\n",
+    ") -> np.ndarray:\n",
+    "    pipe = build_pipeline(num_cols, cat_cols, model=model)\n",
+    "    pipe.fit(_sanitize(x_train_df, cat_cols), y_train)\n",
+    "    return pipe.predict_proba(_sanitize(x_test_df, cat_cols))[:, 1]\n",
+    "\n",
+    "\n",
+    "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n",
+    "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n",
+    "base_rate = float(y_test.mean())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_021",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "num_eng = num_base + ENGINEERED_NUMERIC\n",
+    "\n",
+    "probs_lr_flat = fit_and_score(train[base_cols], test[base_cols], num_base, cat_base, model=\"lr\")\n",
+    "probs_gbm_flat = fit_and_score(train[base_cols], test[base_cols], num_base, cat_base, model=\"gbm\")\n",
+    "probs_lr_eng = fit_and_score(\n",
+    "    train_eng[base_cols + ENGINEERED_NUMERIC],\n",
+    "    test_eng[base_cols + ENGINEERED_NUMERIC],\n",
+    "    num_eng,\n",
+    "    cat_eng,\n",
+    "    model=\"lr\",\n",
+    ")\n",
+    "probs_gbm_eng = fit_and_score(\n",
+    "    train_eng[base_cols + ENGINEERED_NUMERIC],\n",
+    "    test_eng[base_cols + ENGINEERED_NUMERIC],\n",
+    "    num_eng,\n",
+    "    cat_eng,\n",
+    "    model=\"gbm\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_022",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def panel(probs: np.ndarray, label: str) -> dict[str, float]:\n",
+    "    return {\n",
+    "        \"label\": label,\n",
+    "        \"auc\": float(roc_auc_score(y_test, probs)),\n",
+    "        \"ap\": float(average_precision_score(y_test, probs)),\n",
+    "        \"brier\": float(brier_score_loss(y_test, probs)),\n",
+    "        \"p@100\": precision_at_k(probs, y_test, 100),\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "rows = [\n",
+    "    panel(probs_lr_flat, \"LR  flat        \"),\n",
+    "    panel(probs_gbm_flat, \"GBM flat        \"),\n",
+    "    panel(probs_lr_eng, \"LR  flat+rel    \"),\n",
+    "    panel(probs_gbm_eng, \"GBM flat+rel    \"),\n",
+    "]\n",
+    "print(f\"base rate: {base_rate:.3f}\")\n",
+    "print()\n",
+    "print(f\"{'model':<18s}  {'AUC':>7s}  {'AP':>7s}  {'Brier':>7s}  {'P@100':>7s}\")\n",
+    "for r in rows:\n",
+    "    print(f\"{r['label']:<18s}  {r['auc']:.4f}  {r['ap']:.4f}  {r['brier']:.4f}  {r['p@100']:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_023",
+   "metadata": {},
+   "source": "## 6. Lift over the flat baseline\n\n*Sign convention.* `ΔAUC`, `ΔAP`, `ΔP@100`: **higher is\nbetter** (positive = engineered features helped). `ΔBrier`:\n**lower is better** (Brier is a loss, so negative ΔBrier =\nimproved calibration)."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_024",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def delta(eng: np.ndarray, base: np.ndarray, name: str) -> dict[str, float]:\n",
+    "    return {\n",
+    "        \"label\": name,\n",
+    "        \"auc\": float(roc_auc_score(y_test, eng) - roc_auc_score(y_test, base)),\n",
+    "        \"ap\": float(average_precision_score(y_test, eng) - average_precision_score(y_test, base)),\n",
+    "        \"brier\": float(brier_score_loss(y_test, eng) - brier_score_loss(y_test, base)),\n",
+    "        \"p@100\": float(precision_at_k(eng, y_test, 100) - precision_at_k(base, y_test, 100)),\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "deltas = [\n",
+    "    delta(probs_gbm_eng, probs_gbm_flat, \"GBM(eng) - GBM(flat)\"),\n",
+    "    delta(probs_gbm_eng, probs_lr_flat, \"GBM(eng) - LR(flat) \"),\n",
+    "    delta(probs_lr_eng, probs_lr_flat, \"LR(eng)  - LR(flat) \"),\n",
+    "]\n",
+    "print(f\"{'comparison':<22s}  {'ΔAUC':>8s}  {'ΔAP':>8s}  {'ΔBrier':>8s}  {'ΔP@100':>8s}\")\n",
+    "for d in deltas:\n",
+    "    print(\n",
+    "        f\"{d['label']:<22s}  {d['auc']:+8.4f}  {d['ap']:+8.4f}  \"\n",
+    "        f\"{d['brier']:+8.4f}  {d['p@100']:+8.4f}\"\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_025",
+   "metadata": {},
+   "source": "## 7. Tolerance gate\n\nPin the four model AUCs and the headline lift to per-metric\ntolerances so a regression in any pipeline component (feature\nengineering, leakage discipline, sklearn upgrade) breaks CI.\nTargets and tolerances are seed-42-specific by design — this\nnotebook is reproducible-by-seed, not a cross-seed median."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cell_026",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Single-seed (seed=42) AUCs observed on the as-shipped\n",
+    "# intermediate bundle.  Tolerances allow ±0.02 around each\n",
+    "# baseline (well outside numerical jitter, well inside the\n",
+    "# band that would let GBM(eng) silently drop below GBM(flat)).\n",
+    "NB02_TARGETS = {\n",
+    "    \"lr_flat_auc\": 0.8737,\n",
+    "    \"gbm_flat_auc\": 0.8432,\n",
+    "    \"lr_eng_auc\": 0.8763,\n",
+    "    \"gbm_eng_auc\": 0.8579,\n",
+    "    \"headline_lift_auc\": 0.0147,  # GBM(eng) - GBM(flat)\n",
+    "}\n",
+    "NB02_TOLERANCES = {\n",
+    "    \"lr_flat_auc\": 0.02,\n",
+    "    \"gbm_flat_auc\": 0.02,\n",
+    "    \"lr_eng_auc\": 0.02,\n",
+    "    \"gbm_eng_auc\": 0.02,\n",
+    "    \"headline_lift_auc\": 0.015,  # tighter — sign-aware below\n",
+    "}\n",
+    "\n",
+    "observed = {\n",
+    "    \"lr_flat_auc\": rows[0][\"auc\"],\n",
+    "    \"gbm_flat_auc\": rows[1][\"auc\"],\n",
+    "    \"lr_eng_auc\": rows[2][\"auc\"],\n",
+    "    \"gbm_eng_auc\": rows[3][\"auc\"],\n",
+    "    \"headline_lift_auc\": deltas[0][\"auc\"],\n",
+    "}\n",
+    "assert_within_tolerance(\n",
+    "    observed=observed,\n",
+    "    target=NB02_TARGETS,\n",
+    "    tolerances=NB02_TOLERANCES,\n",
+    "    label=\"notebook 02 metric panel (seed 42, intermediate)\",\n",
+    ")\n",
+    "assert observed[\"headline_lift_auc\"] > 0.0, (\n",
+    "    \"GBM(eng) − GBM(flat) AUC went non-positive — relational \"\n",
+    "    \"lift disappeared; investigate before merging.\"\n",
+    ")\n",
+    "print(\"OK — all panel metrics within tolerance and headline lift is positive.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell_027",
+   "metadata": {},
+   "source": "## 8. Honest takeaway\n\nOn seed 42 the GBM(eng) − GBM(flat) AUC lift is small\n(+0.0147). Cross-seed variance for `gbm_auc` on this bundle\nis ~0.027 (see `release/validation/validation_report.json`,\n`tiers.intermediate.spreads.gbm_auc`), so a single-seed lift\nof this size is **suggestive, not conclusive**. Confirming a\nreal signal needs a seed sweep — see the cohort-shift / seed\nharness coming in PR 6.2's notebook 04.\n\nThe lift also does **not** flip the sign of the GBM-vs-LR\ncomparison: GBM(eng) is still slightly below LR(flat). This\nis the same v1 finding documented in\n`release/validation/validation_report.md` (gate **G7.4.4**)\nand the dataset card: the v1 snapshot is dominated by\nroughly-linear signal, and HistGBM doesn't consistently beat\nLR on it. Engineered relational features narrow the gap; on\nthis seed they don't yet erase it.\n\nTwo takeaways for downstream users:\n\n1. **Joins on the public bundle are leakage-safe by\n   construction.** Section 3 above is the full proof. You can\n   aggregate any of the four event tables without policing the\n   horizon yourself.\n2. **Bring your own non-linearities.** If a feature\n   engineering choice (cross-table interactions, tree\n   kernels, learned embeddings, bigger seed sweeps) flips the\n   GBM-vs-LR sign reliably, that's a finding worth filing —\n   the *break_me_guide* template lands in PR 6.3.\n\n## Next\n\n- **Notebook 03** *(coming in PR 6.2)* — leakage and\n  time-window walkthrough, including the deliberate\n  `total_touches_all` trap notebook 01 keeps and this notebook\n  drops.\n- **Notebook 04** *(coming in PR 6.2)* — value-aware ranking,\n  calibration, and cohort-shift evaluation with a seed sweep."
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/release/notebooks/_notebook_utils.py b/release/notebooks/_notebook_utils.py
new file mode 100644
index 0000000..c2baf71
--- /dev/null
+++ b/release/notebooks/_notebook_utils.py
@@ -0,0 +1,116 @@
+"""Shared helpers for the public release notebooks.
+
+The release notebooks are downloaded by Kaggle / HF consumers alongside
+the parquet bundle.  They cannot rely on ``leadforge`` being installed
+in the consumer's environment, so the helpers needed inside the
+notebooks live here as a small sibling module with no project imports.
+
+The metric helpers mirror ``leadforge.validation.release_quality`` so
+notebook-side numbers and validation-report numbers compare apples to
+apples — same ranking convention, same tie-breaking, same calibration
+binning — and the ``assert_within_tolerance`` gate (G13.2) is meaningful.
+"""
+
+from __future__ import annotations
+
+import math
+from collections.abc import Mapping
+
+import numpy as np
+
+
+def precision_at_k(scores: np.ndarray, y: np.ndarray, k: int) -> float:
+    """Mean label of the top-``k`` rows by descending score.
+
+    Mirrors ``leadforge.validation.release_quality._precision_at_k``:
+    stable argsort so ties resolve identically to the validation report.
+    """
+    scores = np.asarray(scores)
+    y = np.asarray(y)
+    if k <= 0 or k > len(y):
+        return float("nan")
+    order = np.argsort(-scores, kind="stable")
+    return float(y[order[:k]].mean())
+
+
+def top_decile_rate(scores: np.ndarray, y: np.ndarray) -> float:
+    """Precision at the top 10% of ranked scores."""
+    n = len(y)
+    if n == 0:
+        return float("nan")
+    return precision_at_k(scores, y, max(1, int(round(n * 0.1))))
+
+
+def assert_within_tolerance(
+    observed: Mapping[str, float],
+    target: Mapping[str, float],
+    tolerances: Mapping[str, float] | float,
+    *,
+    label: str = "metrics",
+) -> None:
+    """Assert ``|observed[k] - target[k]| <= tol`` for every key in ``target``.
+
+    Backs the G13.2 acceptance gate inside the notebooks: once the
+    notebook has computed its own metrics it pins them against the
+    cross-seed-median values from ``release/validation/validation_report.md``
+    and fails loudly if the notebook drifts out of band.
+
+    The gate is intentionally strict about silent-pass paths:
+
+    * Non-finite ``observed`` or ``target`` values fail (rather than
+      slipping through because ``NaN > tol`` evaluates ``False``).
+    * When ``tolerances`` is a mapping, every key in ``target`` must
+      have an explicit tolerance — a missing entry is treated as a
+      configuration error and aborts the gate up front, instead of
+      defaulting to ``+inf`` and silently disabling the check for
+      that metric.
+
+    Args:
+        observed: Notebook-computed metrics, keyed by metric name.
+        target: Reference values from the validation report.
+        tolerances: Either a per-metric tolerance map (every key in
+            ``target`` must be present), or a single float applied to
+            every metric (G13.2's default is 0.05).
+        label: Human-readable name for the metric panel; appears in the
+            error message so the failing assertion identifies its source.
+
+    Raises:
+        AssertionError: when any metric falls outside its tolerance,
+            ``observed`` is missing a key listed in ``target``, an
+            ``observed`` / ``target`` value is non-finite, or
+            ``tolerances`` is a mapping that omits a required key.
+    """
+    if isinstance(tolerances, int | float):
+        per_key: Mapping[str, float] = {k: float(tolerances) for k in target}
+    else:
+        per_key = tolerances
+        missing_tolerances = [k for k in target if k not in per_key]
+        if missing_tolerances:
+            raise AssertionError(
+                f"{label}: tolerances mapping is missing entries for "
+                f"target metrics: {sorted(missing_tolerances)}.  "
+                "Falling back to +inf would silently disable the gate "
+                "for these metrics; declare an explicit tolerance per key."
+            )
+    failures: list[str] = []
+    for key, target_value in target.items():
+        if key not in observed:
+            failures.append(f"  {key}: missing from observed metrics")
+            continue
+        observed_f = float(observed[key])
+        target_f = float(target_value)
+        if not (math.isfinite(observed_f) and math.isfinite(target_f)):
+            failures.append(
+                f"  {key}: non-finite value (observed={observed_f}, "
+                f"target={target_f}) — gate refuses to silently pass NaN/inf"
+            )
+            continue
+        tol = float(per_key[key])
+        diff = abs(observed_f - target_f)
+        if diff > tol:
+            failures.append(
+                f"  {key}: observed={observed_f:.4f} target={target_f:.4f} "
+                f"|diff|={diff:.4f} > tol={tol:.4f}"
+            )
+    if failures:
+        raise AssertionError(f"{label} drifted outside tolerance:\n" + "\n".join(failures))
diff --git a/release/notebooks/_release_targets.json b/release/notebooks/_release_targets.json
new file mode 100644
index 0000000..7ed4cdb
--- /dev/null
+++ b/release/notebooks/_release_targets.json
@@ -0,0 +1,10 @@
+{
+ "_doc": "Cross-seed-median metric values from release/validation/validation_report.json, sliced to the metrics the release notebooks pin via assert_within_tolerance. Audited against the report by tests/release/notebooks/test_release_targets_match_report.py — if you change a value here, the test will fail unless the corresponding median in the validation report changes to match.",
+ "intermediate": {
+  "brier_score": 0.10963449613199748,
+  "gbm_auc": 0.875461913160326,
+  "lr_auc": 0.8858759553203998,
+  "lr_average_precision": 0.5752148545119874,
+  "top_decile_rate": 0.5866666666666667
+ }
+}
diff --git a/scripts/_release_notebook_common.py b/scripts/_release_notebook_common.py
new file mode 100644
index 0000000..ef25263
--- /dev/null
+++ b/scripts/_release_notebook_common.py
@@ -0,0 +1,101 @@
+"""Shared scaffolding for the ``scripts/build_release_notebook_*.py`` builders.
+
+The release-notebook builders share ~80% of their plumbing: imports,
+``md`` / ``code`` cell wrappers, the metadata block, and the
+``write_notebook`` step that serializes JSON and shells out to
+``ruff format`` so builder output matches the project's pre-commit hook
+byte-for-byte.  Putting them here keeps each per-notebook builder
+focused on its cell list.
+
+Cell IDs are assigned deterministically (``cell_000``, ``cell_001``,
+...) so re-running a builder produces a byte-identical ``.ipynb`` —
+audit-artifact-sync, the same invariant PR 4.1 / 5.1 / 5.2 use for
+``release/`` artifacts.  Nondeterministic IDs are the default in
+``nbformat.v4.new_*_cell``; without an explicit override every build
+diverges and the byte-equality test in
+``tests/scripts/test_release_notebook_builders.py`` fails.
+
+Each builder exposes an ``--out PATH`` flag (see ``builder_arg_parser``)
+so the byte-stability test can build into ``tmp_path`` rather than
+mutating the committed ``release/notebooks/<name>.ipynb`` in place.
+That removes a pytest-xdist race and the worktree-dirtying failure
+mode under interrupted runs.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import subprocess
+from pathlib import Path
+from textwrap import dedent
+
+import nbformat as nbf
+
+_KERNEL_METADATA = {
+    "kernelspec": {
+        "display_name": "Python 3",
+        "language": "python",
+        "name": "python3",
+    },
+    "language_info": {
+        "name": "python",
+        "version": "3.11",
+    },
+}
+
+
+def md(source: str) -> nbf.NotebookNode:
+    """Markdown cell with ``source`` dedented and surrounding blank lines stripped."""
+    return nbf.v4.new_markdown_cell(dedent(source).strip("\n"))
+
+
+def code(source: str) -> nbf.NotebookNode:
+    """Cleared code cell (no execution_count, no outputs)."""
+    cell = nbf.v4.new_code_cell(dedent(source).strip("\n"))
+    cell["execution_count"] = None
+    cell["outputs"] = []
+    return cell
+
+
+def assemble_notebook(cells: list[nbf.NotebookNode]) -> nbf.NotebookNode:
+    """Build a notebook from ``cells`` with deterministic cell IDs and stable metadata."""
+    nb = nbf.v4.new_notebook()
+    for index, cell in enumerate(cells):
+        cell["id"] = f"cell_{index:03d}"
+    nb.cells = cells
+    nb.metadata = _KERNEL_METADATA
+    return nb
+
+
+def write_notebook(out_path: Path, nb: nbf.NotebookNode) -> None:
+    """Write ``nb`` to ``out_path`` then run ``ruff format`` for hook-parity.
+
+    Without the post-write format step a contributor running the builder
+    would see pre-commit reformat their notebook on commit, defeating the
+    audit-artifact-sync invariant.  ``json.dumps`` parameters are pinned
+    so the pre-format bytes are deterministic.
+    """
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    text = json.dumps(nb, indent=1, sort_keys=True, ensure_ascii=False)
+    out_path.write_text(text + "\n", encoding="utf-8")
+    subprocess.run(["ruff", "format", str(out_path)], check=True)  # noqa: S603, S607
+    print(f"wrote {out_path}")
+
+
+def builder_arg_parser(*, default_out: Path, description: str) -> argparse.ArgumentParser:
+    """Return the argparse parser shared by every notebook builder.
+
+    Exposes ``--out PATH`` so callers (notably the byte-stability test)
+    can redirect the build into a temp directory.  Defaults to the
+    canonical committed ``release/notebooks/<name>.ipynb`` path so the
+    no-arg invocation contributors run today keeps working.
+    """
+    parser = argparse.ArgumentParser(description=description)
+    parser.add_argument(
+        "--out",
+        type=Path,
+        default=default_out,
+        help=f"Path to write the notebook to (default: {default_out})",
+    )
+    return parser
diff --git a/scripts/build_public_release.py b/scripts/build_public_release.py
index f597d7b..5ccb152 100644
--- a/scripts/build_public_release.py
+++ b/scripts/build_public_release.py
@@ -125,6 +125,16 @@ def main() -> None:
             "Default: wall-clock now. Use this for reproducible bundles."
         ),
     )
+    parser.add_argument(
+        "--tier",
+        choices=[name for name, _, _ in BUNDLES] + ["all"],
+        default="all",
+        help=(
+            "Build only the named tier (default: all four). "
+            "Useful for CI jobs that need a single bundle — e.g. the "
+            "release-notebooks job only needs ``intermediate``."
+        ),
+    )
     args = parser.parse_args()
 
     output_root = Path(args.output_dir)
@@ -135,7 +145,8 @@ def main() -> None:
     if license_src.exists():
         shutil.copy2(license_src, output_root / "LICENSE")
 
-    for dir_name, exposure_mode, difficulty in BUNDLES:
+    selected = BUNDLES if args.tier == "all" else [b for b in BUNDLES if b[0] == args.tier]
+    for dir_name, exposure_mode, difficulty in selected:
         bundle_dir = output_root / dir_name
         print(f"Generating {dir_name} ({exposure_mode}, {difficulty})...", file=sys.stderr)
         generate_and_save(
@@ -162,7 +173,7 @@ def main() -> None:
 
     # Summary
     print("\n=== Release Summary ===")
-    for dir_name, _, _ in BUNDLES:
+    for dir_name, _, _ in selected:
         bundle_dir = output_root / dir_name
         print_summary(bundle_dir, dir_name)
 
diff --git a/scripts/build_release_notebook_01.py b/scripts/build_release_notebook_01.py
new file mode 100644
index 0000000..fc6b14f
--- /dev/null
+++ b/scripts/build_release_notebook_01.py
@@ -0,0 +1,437 @@
+"""One-shot builder for ``release/notebooks/01_baseline_lead_scoring.ipynb``.
+
+Run from the repository root::
+
+    python scripts/build_release_notebook_01.py
+
+Cells are assigned deterministic IDs by ``_release_notebook_common`` so
+re-running yields a byte-identical file — same audit-artifact-sync
+pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts.  The byte
+equality is enforced in CI by ``tests/scripts/test_release_notebook_builders.py``.
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+# Make ``scripts/`` importable regardless of how this file is loaded
+# (mirrors ``scripts/package_hf_release.py``'s sys.path dance).
+sys.path.insert(0, str(Path(__file__).resolve().parent))
+
+import nbformat as nbf  # noqa: E402 — must follow sys.path insert
+from _release_notebook_common import (  # noqa: E402 — must follow sys.path insert
+    assemble_notebook,
+    builder_arg_parser,
+    code,
+    md,
+    write_notebook,
+)
+
+DEFAULT_OUT = (
+    Path(__file__).resolve().parents[1] / "release" / "notebooks" / "01_baseline_lead_scoring.ipynb"
+)
+
+
+def cells() -> list[nbf.NotebookNode]:
+    return [
+        md(
+            """
+            # Notebook 01 — Baseline Lead Scoring
+
+            **Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier (the
+            release default).
+
+            **Goal:** train Logistic Regression and Histogram Gradient Boosting
+            baselines on the snapshot-safe public bundle, and verify they
+            reproduce the cross-seed-median metrics in
+            [`release/validation/validation_report.md`](../validation/validation_report.md)
+            within the per-metric tolerances fixed by acceptance gate **G13.2**.
+
+            **Public path discipline (G13.3).** This notebook reads only from
+            the public bundle at `release/intermediate/`. The instructor
+            companion (`release/intermediate_instructor/`, with full-horizon
+            event tables, the latent registry, the hidden DAG, and the
+            mechanism summary) is **not** loaded — public modelling work must
+            never depend on instructor-only artefacts.
+            """
+        ),
+        md("## 1. Setup"),
+        code(
+            """
+            from __future__ import annotations
+
+            import json
+            import sys
+            from pathlib import Path
+
+            import numpy as np
+            import pandas as pd
+            from sklearn.compose import ColumnTransformer
+            from sklearn.ensemble import HistGradientBoostingClassifier
+            from sklearn.impute import SimpleImputer
+            from sklearn.linear_model import LogisticRegression
+            from sklearn.metrics import (
+                average_precision_score,
+                brier_score_loss,
+                roc_auc_score,
+            )
+            from sklearn.pipeline import Pipeline
+            from sklearn.preprocessing import OneHotEncoder, StandardScaler
+
+            sys.path.insert(0, str(Path.cwd()))
+            from _notebook_utils import (
+                assert_within_tolerance,
+                precision_at_k,
+                top_decile_rate,
+            )
+
+            SEED = 42
+            BUNDLE = Path("../intermediate")          # public student bundle
+            TASK = "converted_within_90_days"
+            """
+        ),
+        md(
+            """
+            ## 2. Reproduction targets
+
+            We pin the cross-seed-median metrics for the *intermediate* tier
+            (seeds 42–46) from `release/validation/validation_report.json`.
+            The targets live in a sibling file
+            (`release/notebooks/_release_targets.json`) so they can't drift
+            from the validation report without an audit-sync test failure
+            in CI.
+
+            **Per-metric tolerances** are tighter than a flat 5 % band: the
+            cross-seed standard deviation in the report is well under 0.02
+            on AUC and Brier, and a flat ±0.05 would let a regression slip
+            through. Average-precision and the small-`k` `top_decile_rate`
+            stay at ±0.05 because their seed-to-seed variance is larger.
+            """
+        ),
+        code(
+            """
+            with (Path.cwd() / "_release_targets.json").open() as fh:
+                targets = json.load(fh)["intermediate"]
+
+            # Re-key the validation report's metric names into the metric
+            # names this notebook prints below, so the gate compares apples
+            # to apples.
+            VALIDATION_REPORT_TARGETS = {
+                "lr_auc": targets["lr_auc"],
+                "gbm_auc": targets["gbm_auc"],
+                "lr_average_precision": targets["lr_average_precision"],
+                "lr_brier": targets["brier_score"],
+                "lr_top_decile_rate": targets["top_decile_rate"],
+            }
+            TOLERANCES = {
+                "lr_auc": 0.02,                  # G13.2 — tighter than a flat 5%
+                "gbm_auc": 0.02,
+                "lr_average_precision": 0.05,    # higher seed variance
+                "lr_brier": 0.02,
+                "lr_top_decile_rate": 0.05,      # small-k variance
+            }
+            for k, v in VALIDATION_REPORT_TARGETS.items():
+                print(f"  target  {k:<24s} {v:.4f}  (tol ±{TOLERANCES[k]:.2f})")
+            """
+        ),
+        md(
+            """
+            ## 3. Load the bundle
+
+            We load the parquet task splits — the canonical format the
+            release ships in. The accompanying `lead_scoring.csv` is a
+            convenience export with the same rows but coerced dtypes;
+            sticking with parquet preserves nullable `Int64` / `Float64` /
+            `boolean` columns the way the validator sees them.
+            """
+        ),
+        code(
+            """
+            train = pd.read_parquet(BUNDLE / "tasks" / TASK / "train.parquet")
+            valid = pd.read_parquet(BUNDLE / "tasks" / TASK / "valid.parquet")
+            test = pd.read_parquet(BUNDLE / "tasks" / TASK / "test.parquet")
+
+            with (BUNDLE / "manifest.json").open() as fh:
+                manifest = json.load(fh)
+
+            assert manifest["exposure_mode"] == "student_public", (
+                "this notebook expects the public bundle"
+            )
+            assert manifest["relational_snapshot_safe"] is True
+
+            print(f"Train: {len(train):,} rows")
+            print(f"Valid: {len(valid):,} rows  (held out — not used here)")
+            print(f"Test:  {len(test):,} rows")
+            print()
+            print(f"Bundle exposure_mode: {manifest['exposure_mode']}")
+            print(f"Bundle snapshot_day:  {manifest['snapshot_day']}")
+            print(f"Bundle horizon_days:  {manifest['horizon_days']}")
+            print()
+            print("Conversion rates:")
+            for name, df in [("train", train), ("valid", valid), ("test", test)]:
+                print(f"  {name}: {df[TASK].mean():.1%}")
+            """
+        ),
+        md(
+            """
+            ## 4. Feature selection
+
+            We use the **same feature set as `release/validation/validation_report.json`**
+            so the gate in section 7 is a real reproduction check rather
+            than a related-but-different number. That means we drop only
+            the IDs and the label — every other column in `train` (including
+            `total_touches_all`, the documented leakage trap) goes into the
+            pipeline.
+
+            **About `total_touches_all`.** The feature dictionary flags it
+            with `leakage_risk = True`: it counts touches over the full
+            90-day horizon, which is post-snapshot data. The validation
+            report keeps it in the panel anyway because (a) its standalone
+            AUC is barely above 0.55 (see the *post_snapshot_aggregates*
+            baseline column in the report) and (b) the report exists to
+            measure the v1 dataset's *as-shipped* difficulty, leakage trap
+            included. **Notebook 03** *(coming in PR 6.2)* walks through
+            what dropping the trap does to performance and how to detect
+            similar traps from feature audits alone.
+            """
+        ),
+        code(
+            """
+            feat_dict = pd.read_csv(BUNDLE / "feature_dictionary.csv")
+            trap_cols = feat_dict.loc[
+                feat_dict["leakage_risk"].astype(bool), "name"
+            ].tolist()
+            ID_COLS = ["account_id", "contact_id", "lead_id", "lead_created_at"]
+            # Mirrors ``release_quality._partition_columns`` — IDs + label only.
+            EXCLUDE = set(ID_COLS + [TASK])
+
+            feature_cols = [c for c in train.columns if c not in EXCLUDE]
+            cat_cols = [
+                c
+                for c in feature_cols
+                if not (
+                    pd.api.types.is_bool_dtype(train[c])
+                    or pd.api.types.is_numeric_dtype(train[c])
+                )
+            ]
+            num_cols = [c for c in feature_cols if c not in cat_cols]
+
+            print(f"Leakage-trap columns kept (see narrative above): {trap_cols}")
+            print(f"Categorical features ({len(cat_cols)}): {cat_cols}")
+            print(f"Numeric features    ({len(num_cols)}): {num_cols}")
+            """
+        ),
+        md(
+            """
+            ## 5. Preprocessing pipeline
+
+            Mirrors `leadforge.validation.release_quality._build_pipeline`
+            so the notebook's metric panel and the validation report's
+            metric panel agree by construction:
+
+            - numeric: median-impute, then `StandardScaler`
+            - categorical: most-frequent-impute, then dense `OneHotEncoder`
+              with `handle_unknown="ignore"`
+            """
+        ),
+        code(
+            """
+            def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame:
+                out = df.copy()
+                for c in cat_cols:
+                    out[c] = out[c].astype(object).where(out[c].notna(), None)
+                return out
+
+            x_train = _sanitize_categoricals(train[feature_cols])
+            x_test = _sanitize_categoricals(test[feature_cols])
+            y_train = train[TASK].astype("boolean").fillna(False).astype(int).to_numpy()
+            y_test = test[TASK].astype("boolean").fillna(False).astype(int).to_numpy()
+
+            numeric_t = Pipeline(
+                [("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
+            )
+            categorical_t = Pipeline(
+                [
+                    ("imputer", SimpleImputer(strategy="most_frequent")),
+                    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
+                ]
+            )
+            preprocessor = ColumnTransformer(
+                transformers=[
+                    ("num", numeric_t, num_cols),
+                    ("cat", categorical_t, cat_cols),
+                ],
+                remainder="drop",
+            )
+            """
+        ),
+        md("## 6. Train baselines and score the test split"),
+        code(
+            """
+            lr_pipe = Pipeline(
+                [
+                    ("preprocessor", preprocessor),
+                    (
+                        "classifier",
+                        LogisticRegression(
+                            max_iter=1000, solver="lbfgs", random_state=SEED
+                        ),
+                    ),
+                ]
+            )
+            gbm_pipe = Pipeline(
+                [
+                    ("preprocessor", preprocessor),
+                    ("classifier", HistGradientBoostingClassifier(random_state=SEED)),
+                ]
+            )
+
+            lr_pipe.fit(x_train, y_train)
+            gbm_pipe.fit(x_train, y_train)
+
+            lr_probs = lr_pipe.predict_proba(x_test)[:, 1]
+            gbm_probs = gbm_pipe.predict_proba(x_test)[:, 1]
+
+            metrics = {
+                "lr_auc": float(roc_auc_score(y_test, lr_probs)),
+                "gbm_auc": float(roc_auc_score(y_test, gbm_probs)),
+                "lr_average_precision": float(average_precision_score(y_test, lr_probs)),
+                "lr_brier": float(brier_score_loss(y_test, lr_probs)),
+                "lr_top_decile_rate": top_decile_rate(lr_probs, y_test),
+                # Print-only; not pinned (the validation report tracks
+                # ``top_decile_rate`` instead, which we gate above).
+                "lr_precision_at_50": precision_at_k(lr_probs, y_test, 50),
+                "lr_precision_at_100": precision_at_k(lr_probs, y_test, 100),
+                "lr_precision_at_200": precision_at_k(lr_probs, y_test, 200),
+            }
+            for k, v in metrics.items():
+                print(f"  {k:<24s} {v:.4f}")
+            """
+        ),
+        md(
+            """
+            ## 7. Tolerance check (G13.2)
+
+            The notebook's printed metrics must match the cross-seed medians
+            in `validation_report.json` to within the per-metric tolerances
+            declared in section 2. If a future change breaks this, the
+            assertion below fails — and CI catches it, because the same
+            cell runs under `nbclient` in the `notebooks` job.
+            """
+        ),
+        code(
+            """
+            assert_within_tolerance(
+                observed=metrics,
+                target=VALIDATION_REPORT_TARGETS,
+                tolerances=TOLERANCES,
+                label="notebook 01 vs validation_report.json (intermediate tier)",
+            )
+            print("OK — all gated metrics are within their per-metric tolerance.")
+            """
+        ),
+        md(
+            """
+            ## 8. Decile lift chart
+
+            Standard sanity-check for ranking quality: sort the test set by
+            score, bucket into deciles, plot the per-decile conversion rate
+            vs the base rate.
+            """
+        ),
+        code(
+            """
+            import matplotlib.pyplot as plt
+
+            order = np.argsort(-lr_probs, kind="stable")
+            y_sorted = y_test[order]
+            n = len(y_test)
+            edges = np.linspace(0, n, 11, dtype=int)
+            decile_rate = np.array(
+                [y_sorted[edges[i] : edges[i + 1]].mean() for i in range(10)]
+            )
+            base_rate = y_test.mean()
+
+            fig, ax = plt.subplots(figsize=(7, 4))
+            ax.bar(range(1, 11), decile_rate, color="#3b82f6")
+            ax.axhline(
+                base_rate,
+                color="#ef4444",
+                linestyle="--",
+                label=f"base rate ({base_rate:.1%})",
+            )
+            ax.set_xticks(range(1, 11))
+            ax.set_xlabel("Score decile (1 = highest)")
+            ax.set_ylabel("Conversion rate")
+            ax.set_title("LR decile lift — intermediate tier (seed 42)")
+            ax.legend(loc="upper right")
+            plt.tight_layout()
+            plt.show()
+            """
+        ),
+        md(
+            """
+            ## 9. Calibration plot
+
+            Reliability diagram: bin predicted probabilities into 10 equal-
+            width buckets, plot mean predicted vs mean observed. The
+            validation report's reference reliability plot for the
+            intermediate tier lives at
+            `release/validation/figures/calibration_intermediate.png`.
+            """
+        ),
+        code(
+            """
+            edges = np.linspace(0.0, 1.0, 11)
+            mean_pred = []
+            mean_actual = []
+            for i in range(10):
+                lo, hi = edges[i], edges[i + 1]
+                mask = (lr_probs >= lo) & ((lr_probs <= hi) if i == 9 else (lr_probs < hi))
+                if mask.sum() == 0:
+                    continue
+                mean_pred.append(lr_probs[mask].mean())
+                mean_actual.append(y_test[mask].mean())
+
+            fig, ax = plt.subplots(figsize=(5, 5))
+            ax.plot([0, 1], [0, 1], color="#9ca3af", linestyle="--", label="perfect calibration")
+            ax.plot(mean_pred, mean_actual, marker="o", color="#3b82f6", label="LR")
+            ax.set_xlim(0, 1)
+            ax.set_ylim(0, 1)
+            ax.set_xlabel("Mean predicted probability")
+            ax.set_ylabel("Observed conversion rate")
+            ax.set_title("Calibration — LR, intermediate tier (seed 42)")
+            ax.legend(loc="upper left")
+            plt.tight_layout()
+            plt.show()
+            """
+        ),
+        md(
+            """
+            ## 10. Next
+
+            - **Notebook 02** — engineer features by joining the snapshot-
+              safe relational tables under `release/intermediate/tables/`,
+              then measure the lift over the flat-CSV LR baseline above.
+            - **Notebook 03** *(coming in PR 6.2)* — leakage and time-window
+              walkthrough; works through what `total_touches_all` does to
+              your AUC if you forget to drop it.
+            - **Notebook 04** *(coming in PR 6.2)* — value-aware ranking
+              (`expected_acv` × P(convert)), threshold selection, and the
+              cohort-shift stress test.
+            """
+        ),
+    ]
+
+
+def main() -> None:
+    args = builder_arg_parser(
+        default_out=DEFAULT_OUT,
+        description="Build release/notebooks/01_baseline_lead_scoring.ipynb",
+    ).parse_args()
+    write_notebook(args.out, assemble_notebook(cells()))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/build_release_notebook_02.py b/scripts/build_release_notebook_02.py
new file mode 100644
index 0000000..ea2cf67
--- /dev/null
+++ b/scripts/build_release_notebook_02.py
@@ -0,0 +1,663 @@
+"""One-shot builder for ``release/notebooks/02_relational_feature_engineering.ipynb``.
+
+Run from the repository root::
+
+    python scripts/build_release_notebook_02.py
+
+Cells are assigned deterministic IDs by ``_release_notebook_common`` so
+re-running yields a byte-identical file — same audit-artifact-sync
+pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts.
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent))
+
+import nbformat as nbf  # noqa: E402 — must follow sys.path insert
+from _release_notebook_common import (  # noqa: E402 — must follow sys.path insert
+    assemble_notebook,
+    builder_arg_parser,
+    code,
+    md,
+    write_notebook,
+)
+
+DEFAULT_OUT = (
+    Path(__file__).resolve().parents[1]
+    / "release"
+    / "notebooks"
+    / "02_relational_feature_engineering.ipynb"
+)
+
+
+def cells() -> list[nbf.NotebookNode]:
+    return [
+        md(
+            """
+            # Notebook 02 — Relational Feature Engineering
+
+            **Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.
+
+            The flat task split in notebook 01 is one snapshot view of a
+            richer relational world. The public bundle ships seven
+            **snapshot-safe** tables under `release/intermediate/tables/`:
+            every event row is filtered to `timestamp <= lead_created_at +
+            snapshot_day`, so any join you can write is **leakage-safe by
+            construction**.
+
+            This notebook walks through:
+
+            1. Loading the seven public tables.
+            2. Verifying the snapshot-safe contract inline (the teachable
+               moment — see the contract pass, don't just read about it).
+            3. Engineering four relational features, with train-only
+               discipline for any aggregation that crosses splits.
+            4. Training a HistGBM on `flat ∪ engineered` columns.
+            5. Reporting the AUC / AP / Brier / P@K delta vs the flat-CSV
+               baseline, with a tolerance gate that fails CI if the
+               headline lift regresses.
+
+            **Public path discipline (G13.3).** This notebook reads only
+            from `release/intermediate/`. The instructor companion
+            (`release/intermediate_instructor/`) is **not** loaded —
+            relational feature engineering must work from the public
+            artefact alone. Tables omitted from the public bundle on
+            purpose (`customers`, `subscriptions`) live only in the
+            instructor companion because their mere existence reconstructs
+            the label.
+
+            **Leakage-trap discipline.** Unlike notebook 01 (which
+            reproduces the validation report's panel verbatim and
+            therefore keeps `total_touches_all`), notebook 02 **drops**
+            the trap from the flat baseline. We're teaching feature
+            engineering here; mixing a known-leaky column into the
+            "before" panel would muddy the relational lift attribution.
+            """
+        ),
+        md("## 1. Setup"),
+        code(
+            """
+            from __future__ import annotations
+
+            import json
+            import sys
+            from pathlib import Path
+
+            import numpy as np
+            import pandas as pd
+            from sklearn.compose import ColumnTransformer
+            from sklearn.ensemble import HistGradientBoostingClassifier
+            from sklearn.impute import SimpleImputer
+            from sklearn.linear_model import LogisticRegression
+            from sklearn.metrics import (
+                average_precision_score,
+                brier_score_loss,
+                roc_auc_score,
+            )
+            from sklearn.pipeline import Pipeline
+            from sklearn.preprocessing import OneHotEncoder, StandardScaler
+
+            sys.path.insert(0, str(Path.cwd()))
+            from _notebook_utils import assert_within_tolerance, precision_at_k
+
+            SEED = 42
+            BUNDLE = Path("../intermediate")          # public student bundle
+            TASK = "converted_within_90_days"
+
+            with (BUNDLE / "manifest.json").open() as fh:
+                manifest = json.load(fh)
+            assert manifest["exposure_mode"] == "student_public"
+            assert manifest["relational_snapshot_safe"] is True
+            SNAPSHOT_DAY = int(manifest["snapshot_day"])
+            print(f"snapshot_day = {SNAPSHOT_DAY}")
+            """
+        ),
+        md(
+            """
+            ## 2. Load the seven public tables
+
+            These are the only tables present under `release/intermediate/
+            tables/`. `customers` and `subscriptions` are deliberately
+            absent — they only exist for *converted* leads, so their
+            presence in a public bundle would reconstruct the label.
+            """
+        ),
+        code(
+            """
+            PUBLIC_TABLES = (
+                "accounts",
+                "contacts",
+                "leads",
+                "touches",
+                "sessions",
+                "sales_activities",
+                "opportunities",
+            )
+            tables = {
+                name: pd.read_parquet(BUNDLE / "tables" / f"{name}.parquet")
+                for name in PUBLIC_TABLES
+            }
+            for name in PUBLIC_TABLES:
+                print(f"  {name:<20s} {tables[name].shape}")
+
+            train = pd.read_parquet(BUNDLE / "tasks" / TASK / "train.parquet")
+            test = pd.read_parquet(BUNDLE / "tasks" / TASK / "test.parquet")
+            """
+        ),
+        md(
+            """
+            ## 3. Verify the snapshot-safe contract
+
+            For every event-table row joined back to its lead, assert
+            `timestamp <= lead_created_at + snapshot_day`. The reported
+            **headroom** is the *minimum* gap any event row leaves between
+            its timestamp and the snapshot cutoff — a non-negative number
+            when the contract holds. Showing the actual minimum (rather
+            than just "all <= cutoff") makes the receipt honest: if a
+            future regeneration ever shaves the contract close, you'll see
+            the headroom shrink before the assertion fires.
+            """
+        ),
+        code(
+            """
+            leads_index = tables["leads"][["lead_id", "lead_created_at"]].copy()
+            leads_index["lead_created_at"] = pd.to_datetime(leads_index["lead_created_at"])
+            cutoff = leads_index.assign(
+                cutoff=leads_index["lead_created_at"] + pd.Timedelta(days=SNAPSHOT_DAY)
+            )
+
+            EVENT_TABLES = [
+                ("touches", "touch_timestamp"),
+                ("sessions", "session_timestamp"),
+                ("sales_activities", "activity_timestamp"),
+                ("opportunities", "created_at"),
+            ]
+
+            for tbl, ts_col in EVENT_TABLES:
+                df = tables[tbl][["lead_id", ts_col]].merge(
+                    cutoff[["lead_id", "cutoff"]], on="lead_id", how="left"
+                )
+                df[ts_col] = pd.to_datetime(df[ts_col])
+                violations = df[df[ts_col] > df["cutoff"]]
+                assert len(violations) == 0, (
+                    f"{tbl}.{ts_col}: {len(violations)} rows past snapshot cutoff"
+                )
+                min_headroom = (df["cutoff"] - df[ts_col]).min()
+                min_headroom_days = float(min_headroom.total_seconds()) / 86400.0
+                print(
+                    f"  {tbl}.{ts_col}: {len(df):>6,} rows; "
+                    f"min headroom under cutoff: {min_headroom_days:6.2f} days"
+                )
+
+            print()
+            print("OK — every event row in every public event table satisfies")
+            print(f"     timestamp <= lead_created_at + {SNAPSHOT_DAY} days.")
+            """
+        ),
+        md(
+            """
+            ## 4. Engineered features
+
+            We build four relational features. Each one starts as a
+            per-lead aggregate over a single snapshot-safe table, then
+            joins back into the per-lead snapshot. Aggregations that pool
+            across leads (account-level density, target encoding) are
+            **fit on the train split only** and applied to test via
+            join — same train-only discipline that prevents target leakage
+            in mean encoding.
+
+            | # | Feature | Source table(s) | Aggregation |
+            |---|---|---|---|
+            | 1 | `touches_ch_*` (3 cols) | `touches` | per-lead × `touch_channel` count |
+            | 2 | `account_avg_touches_per_lead` | `touches`, `leads`, train lead set | account-level rollup over **train leads only**, then merge back |
+            | 3 | `days_since_last_activity` | `sales_activities`, `leads` | per-lead recency at snapshot cutoff |
+            | 4 | `industry_target_encoding_train` | `accounts`, train | mean-target encoding **fit on train only** |
+            """
+        ),
+        md("### 4.1 Touch-channel breakdown"),
+        code(
+            """
+            touches = tables["touches"]
+            channel_counts = (
+                touches.groupby(["lead_id", "touch_channel"]).size().unstack(fill_value=0)
+            )
+            channel_counts.columns = [f"touches_ch_{c}" for c in channel_counts.columns]
+            print(f"channel feature columns: {list(channel_counts.columns)}")
+            channel_counts.head()
+            """
+        ),
+        md(
+            """
+            ### 4.2 Account-level touch density
+
+            How active is the account this lead belongs to, on average? An
+            account with many leads and many touches per lead is a
+            structurally different prospect than a one-touch account.
+
+            **Train-only.** We compute the account-level rollup using only
+            train leads' touches. Test leads in train-only-empty accounts
+            fall back to 0 via `fillna`. This avoids letting test rows
+            influence the train feature distribution — same discipline
+            applied to mean-target encoding in 4.4.
+            """
+        ),
+        code(
+            """
+            train_lead_ids = set(train["lead_id"].tolist())
+            tch_with_acct = touches.merge(
+                tables["leads"][["lead_id", "account_id"]], on="lead_id", how="left"
+            )
+            tch_train = tch_with_acct[tch_with_acct["lead_id"].isin(train_lead_ids)]
+            account_density = (
+                tch_train.groupby("account_id")
+                .agg(
+                    account_total_touches=("touch_id", "count"),
+                    account_lead_count=("lead_id", "nunique"),
+                )
+                .assign(
+                    account_avg_touches_per_lead=lambda d: (
+                        d["account_total_touches"] / d["account_lead_count"]
+                    )
+                )
+                .reset_index()
+            )
+            print(
+                f"account_density rows: {len(account_density):,}  "
+                f"(accounts represented in train touches)"
+            )
+            account_density.head()
+            """
+        ),
+        md(
+            """
+            ### 4.3 Sales-activity recency at snapshot
+
+            Days between the lead's most recent sales activity and the
+            snapshot cutoff (`lead_created_at + snapshot_day`). Recency
+            is a classic engagement signal that's surprisingly hard to
+            recover from the flat snapshot directly.
+            """
+        ),
+        code(
+            """
+            sa = tables["sales_activities"][["lead_id", "activity_timestamp"]].copy()
+            sa["activity_timestamp"] = pd.to_datetime(sa["activity_timestamp"])
+            last_activity = (
+                sa.groupby("lead_id")["activity_timestamp"]
+                .max()
+                .reset_index()
+                .rename(columns={"activity_timestamp": "last_activity_at"})
+            )
+            last_activity = last_activity.merge(cutoff[["lead_id", "cutoff"]], on="lead_id")
+            last_activity["days_since_last_activity"] = (
+                last_activity["cutoff"] - last_activity["last_activity_at"]
+            ).dt.total_seconds() / 86400
+            last_activity[["lead_id", "days_since_last_activity"]].head()
+            """
+        ),
+        md(
+            """
+            ### 4.4 Industry target encoding (train-only, leakage-safe)
+
+            Replace the `industry` string with the conversion rate observed
+            for that industry **on the training split only**. Computing
+            the encoding on test leaks the test labels into the features —
+            a textbook mistake; we avoid it explicitly.
+            """
+        ),
+        code(
+            """
+            tgt_enc = train.groupby("industry")[TASK].mean().to_dict()
+            tgt_enc_global_mean = float(train[TASK].mean())
+            print(f"per-industry train conversion rates ({len(tgt_enc)} industries):")
+            for k, v in sorted(tgt_enc.items()):
+                print(f"  {k:<20s} {v:.3f}")
+            print(f"  fallback global mean: {tgt_enc_global_mean:.3f}")
+            """
+        ),
+        md("### 4.5 Stitch features onto train and test"),
+        code(
+            """
+            ENGINEERED_NUMERIC = (
+                list(channel_counts.columns)
+                + [
+                    "account_avg_touches_per_lead",
+                    "days_since_last_activity",
+                    "industry_target_encoding_train",
+                ]
+            )
+
+            def attach_engineered(df: pd.DataFrame) -> pd.DataFrame:
+                out = df.copy()
+                out = out.merge(channel_counts, on="lead_id", how="left")
+                for col in channel_counts.columns:
+                    out[col] = out[col].fillna(0).astype(int)
+                out = out.merge(
+                    account_density[["account_id", "account_avg_touches_per_lead"]],
+                    on="account_id",
+                    how="left",
+                )
+                out["account_avg_touches_per_lead"] = (
+                    out["account_avg_touches_per_lead"].fillna(0).astype(float)
+                )
+                out = out.merge(
+                    last_activity[["lead_id", "days_since_last_activity"]],
+                    on="lead_id",
+                    how="left",
+                )
+                out["industry_target_encoding_train"] = (
+                    out["industry"].map(tgt_enc).fillna(tgt_enc_global_mean)
+                )
+                return out
+
+            train_eng = attach_engineered(train)
+            test_eng = attach_engineered(test)
+            print(
+                f"train_eng shape: {train_eng.shape}  "
+                f"({train.shape[1]} -> {train_eng.shape[1]} cols)"
+            )
+            print(f"new columns: {ENGINEERED_NUMERIC}")
+            """
+        ),
+        md(
+            """
+            ## 5. Baseline + engineered models
+
+            Same pipeline as notebook 01 (mirrors
+            `leadforge.validation.release_quality._build_pipeline`). We
+            train four models so the comparison is fair:
+
+            | Model | Features | Compares against |
+            |---|---|---|
+            | LR-flat   | flat snapshot, no trap            | (validation report baseline) |
+            | GBM-flat  | flat snapshot, no trap            | LR-flat |
+            | LR-eng    | flat ∪ engineered, no trap, no raw `industry` | LR-flat |
+            | GBM-eng   | flat ∪ engineered, no trap, no raw `industry` | GBM-flat — the headline lift |
+
+            The `+rel` pipelines drop the raw `industry` categorical
+            because the train-only target encoding already represents it
+            as a numeric column — feeding both would feed the same column
+            twice.
+            """
+        ),
+        code(
+            """
+            feat_dict = pd.read_csv(BUNDLE / "feature_dictionary.csv")
+            trap_cols = feat_dict.loc[
+                feat_dict["leakage_risk"].astype(bool), "name"
+            ].tolist()
+            ID_COLS = ["account_id", "contact_id", "lead_id", "lead_created_at"]
+            EXCLUDE = set(ID_COLS + trap_cols + [TASK])
+
+            base_cols = [c for c in train.columns if c not in EXCLUDE]
+            cat_base = [
+                c
+                for c in base_cols
+                if not (
+                    pd.api.types.is_bool_dtype(train[c])
+                    or pd.api.types.is_numeric_dtype(train[c])
+                )
+            ]
+            num_base = [c for c in base_cols if c not in cat_base]
+            # Drop raw `industry` from the +rel categorical list so the LR
+            # pipeline doesn't see it twice (one-hot + target-encoded).
+            cat_eng = [c for c in cat_base if c != "industry"]
+            print(
+                f"flat features: {len(base_cols)}  "
+                f"(numeric={len(num_base)}, categorical={len(cat_base)})"
+            )
+            print(f"engineered (numeric only): {len(ENGINEERED_NUMERIC)}")
+            print(f"+rel categorical list drops: {set(cat_base) - set(cat_eng)}")
+            """
+        ),
+        code(
+            """
+            def _sanitize(df: pd.DataFrame, cat_cols: list[str]) -> pd.DataFrame:
+                out = df.copy()
+                for c in cat_cols:
+                    out[c] = out[c].astype(object).where(out[c].notna(), None)
+                return out
+
+            def build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline:
+                numeric_t = Pipeline(
+                    [
+                        ("imputer", SimpleImputer(strategy="median")),
+                        ("scaler", StandardScaler()),
+                    ]
+                )
+                cat_t = Pipeline(
+                    [
+                        ("imputer", SimpleImputer(strategy="most_frequent")),
+                        (
+                            "encoder",
+                            OneHotEncoder(handle_unknown="ignore", sparse_output=False),
+                        ),
+                    ]
+                )
+                pre = ColumnTransformer(
+                    [("num", numeric_t, num_cols), ("cat", cat_t, cat_cols)],
+                    remainder="drop",
+                )
+                if model == "lr":
+                    clf = LogisticRegression(max_iter=1000, solver="lbfgs", random_state=SEED)
+                else:
+                    clf = HistGradientBoostingClassifier(random_state=SEED)
+                return Pipeline([("preprocessor", pre), ("classifier", clf)])
+
+            def fit_and_score(
+                x_train_df: pd.DataFrame,
+                x_test_df: pd.DataFrame,
+                num_cols: list[str],
+                cat_cols: list[str],
+                *,
+                model: str,
+            ) -> np.ndarray:
+                pipe = build_pipeline(num_cols, cat_cols, model=model)
+                pipe.fit(_sanitize(x_train_df, cat_cols), y_train)
+                return pipe.predict_proba(_sanitize(x_test_df, cat_cols))[:, 1]
+
+            y_train = train[TASK].astype("boolean").fillna(False).astype(int).to_numpy()
+            y_test = test[TASK].astype("boolean").fillna(False).astype(int).to_numpy()
+            base_rate = float(y_test.mean())
+            """
+        ),
+        code(
+            """
+            num_eng = num_base + ENGINEERED_NUMERIC
+
+            probs_lr_flat = fit_and_score(
+                train[base_cols], test[base_cols], num_base, cat_base, model="lr"
+            )
+            probs_gbm_flat = fit_and_score(
+                train[base_cols], test[base_cols], num_base, cat_base, model="gbm"
+            )
+            probs_lr_eng = fit_and_score(
+                train_eng[base_cols + ENGINEERED_NUMERIC],
+                test_eng[base_cols + ENGINEERED_NUMERIC],
+                num_eng, cat_eng, model="lr",
+            )
+            probs_gbm_eng = fit_and_score(
+                train_eng[base_cols + ENGINEERED_NUMERIC],
+                test_eng[base_cols + ENGINEERED_NUMERIC],
+                num_eng, cat_eng, model="gbm",
+            )
+            """
+        ),
+        code(
+            """
+            def panel(probs: np.ndarray, label: str) -> dict[str, float]:
+                return {
+                    "label": label,
+                    "auc": float(roc_auc_score(y_test, probs)),
+                    "ap": float(average_precision_score(y_test, probs)),
+                    "brier": float(brier_score_loss(y_test, probs)),
+                    "p@100": precision_at_k(probs, y_test, 100),
+                }
+
+            rows = [
+                panel(probs_lr_flat, "LR  flat        "),
+                panel(probs_gbm_flat, "GBM flat        "),
+                panel(probs_lr_eng, "LR  flat+rel    "),
+                panel(probs_gbm_eng, "GBM flat+rel    "),
+            ]
+            print(f"base rate: {base_rate:.3f}")
+            print()
+            print(f"{'model':<18s}  {'AUC':>7s}  {'AP':>7s}  {'Brier':>7s}  {'P@100':>7s}")
+            for r in rows:
+                print(
+                    f"{r['label']:<18s}  {r['auc']:.4f}  {r['ap']:.4f}  "
+                    f"{r['brier']:.4f}  {r['p@100']:.4f}"
+                )
+            """
+        ),
+        md(
+            """
+            ## 6. Lift over the flat baseline
+
+            *Sign convention.* `ΔAUC`, `ΔAP`, `ΔP@100`: **higher is
+            better** (positive = engineered features helped). `ΔBrier`:
+            **lower is better** (Brier is a loss, so negative ΔBrier =
+            improved calibration).
+            """
+        ),
+        code(
+            """
+            def delta(eng: np.ndarray, base: np.ndarray, name: str) -> dict[str, float]:
+                return {
+                    "label": name,
+                    "auc": float(roc_auc_score(y_test, eng) - roc_auc_score(y_test, base)),
+                    "ap": float(
+                        average_precision_score(y_test, eng) - average_precision_score(y_test, base)
+                    ),
+                    "brier": float(
+                        brier_score_loss(y_test, eng) - brier_score_loss(y_test, base)
+                    ),
+                    "p@100": float(
+                        precision_at_k(eng, y_test, 100) - precision_at_k(base, y_test, 100)
+                    ),
+                }
+
+            deltas = [
+                delta(probs_gbm_eng, probs_gbm_flat, "GBM(eng) - GBM(flat)"),
+                delta(probs_gbm_eng, probs_lr_flat,  "GBM(eng) - LR(flat) "),
+                delta(probs_lr_eng,  probs_lr_flat,  "LR(eng)  - LR(flat) "),
+            ]
+            print(f"{'comparison':<22s}  {'ΔAUC':>8s}  {'ΔAP':>8s}  {'ΔBrier':>8s}  {'ΔP@100':>8s}")
+            for d in deltas:
+                print(
+                    f"{d['label']:<22s}  {d['auc']:+8.4f}  {d['ap']:+8.4f}  "
+                    f"{d['brier']:+8.4f}  {d['p@100']:+8.4f}"
+                )
+            """
+        ),
+        md(
+            """
+            ## 7. Tolerance gate
+
+            Pin the four model AUCs and the headline lift to per-metric
+            tolerances so a regression in any pipeline component (feature
+            engineering, leakage discipline, sklearn upgrade) breaks CI.
+            Targets and tolerances are seed-42-specific by design — this
+            notebook is reproducible-by-seed, not a cross-seed median.
+            """
+        ),
+        code(
+            """
+            # Single-seed (seed=42) AUCs observed on the as-shipped
+            # intermediate bundle.  Tolerances allow ±0.02 around each
+            # baseline (well outside numerical jitter, well inside the
+            # band that would let GBM(eng) silently drop below GBM(flat)).
+            NB02_TARGETS = {
+                "lr_flat_auc":  0.8737,
+                "gbm_flat_auc": 0.8432,
+                "lr_eng_auc":   0.8763,
+                "gbm_eng_auc":  0.8579,
+                "headline_lift_auc": 0.0147,  # GBM(eng) - GBM(flat)
+            }
+            NB02_TOLERANCES = {
+                "lr_flat_auc":  0.02,
+                "gbm_flat_auc": 0.02,
+                "lr_eng_auc":   0.02,
+                "gbm_eng_auc":  0.02,
+                "headline_lift_auc": 0.015,   # tighter — sign-aware below
+            }
+
+            observed = {
+                "lr_flat_auc":  rows[0]["auc"],
+                "gbm_flat_auc": rows[1]["auc"],
+                "lr_eng_auc":   rows[2]["auc"],
+                "gbm_eng_auc":  rows[3]["auc"],
+                "headline_lift_auc": deltas[0]["auc"],
+            }
+            assert_within_tolerance(
+                observed=observed,
+                target=NB02_TARGETS,
+                tolerances=NB02_TOLERANCES,
+                label="notebook 02 metric panel (seed 42, intermediate)",
+            )
+            assert observed["headline_lift_auc"] > 0.0, (
+                "GBM(eng) − GBM(flat) AUC went non-positive — relational "
+                "lift disappeared; investigate before merging."
+            )
+            print("OK — all panel metrics within tolerance and headline lift is positive.")
+            """
+        ),
+        md(
+            """
+            ## 8. Honest takeaway
+
+            On seed 42 the GBM(eng) − GBM(flat) AUC lift is small
+            (+0.0147). Cross-seed variance for `gbm_auc` on this bundle
+            is ~0.027 (see `release/validation/validation_report.json`,
+            `tiers.intermediate.spreads.gbm_auc`), so a single-seed lift
+            of this size is **suggestive, not conclusive**. Confirming a
+            real signal needs a seed sweep — see the cohort-shift / seed
+            harness coming in PR 6.2's notebook 04.
+
+            The lift also does **not** flip the sign of the GBM-vs-LR
+            comparison: GBM(eng) is still slightly below LR(flat). This
+            is the same v1 finding documented in
+            `release/validation/validation_report.md` (gate **G7.4.4**)
+            and the dataset card: the v1 snapshot is dominated by
+            roughly-linear signal, and HistGBM doesn't consistently beat
+            LR on it. Engineered relational features narrow the gap; on
+            this seed they don't yet erase it.
+
+            Two takeaways for downstream users:
+
+            1. **Joins on the public bundle are leakage-safe by
+               construction.** Section 3 above is the full proof. You can
+               aggregate any of the four event tables without policing the
+               horizon yourself.
+            2. **Bring your own non-linearities.** If a feature
+               engineering choice (cross-table interactions, tree
+               kernels, learned embeddings, bigger seed sweeps) flips the
+               GBM-vs-LR sign reliably, that's a finding worth filing —
+               the *break_me_guide* template lands in PR 6.3.
+
+            ## Next
+
+            - **Notebook 03** *(coming in PR 6.2)* — leakage and
+              time-window walkthrough, including the deliberate
+              `total_touches_all` trap notebook 01 keeps and this notebook
+              drops.
+            - **Notebook 04** *(coming in PR 6.2)* — value-aware ranking,
+              calibration, and cohort-shift evaluation with a seed sweep.
+            """
+        ),
+    ]
+
+
+def main() -> None:
+    args = builder_arg_parser(
+        default_out=DEFAULT_OUT,
+        description="Build release/notebooks/02_relational_feature_engineering.ipynb",
+    ).parse_args()
+    write_notebook(args.out, assemble_notebook(cells()))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tests/release/__init__.py b/tests/release/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/release/notebooks/__init__.py b/tests/release/notebooks/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/release/notebooks/test_execute_notebooks.py b/tests/release/notebooks/test_execute_notebooks.py
new file mode 100644
index 0000000..24beec3
--- /dev/null
+++ b/tests/release/notebooks/test_execute_notebooks.py
@@ -0,0 +1,67 @@
+"""End-to-end execution gate for the public release notebooks (G13.1).
+
+Each notebook under ``release/notebooks/*.ipynb`` is executed top to
+bottom with ``nbclient`` against the committed public release bundles.
+A clean run is the contract:
+
+* G13.1 — every cell executes from a clean kernel without raising.
+* G13.2 — notebook 01's ``assert_within_tolerance`` cell pins notebook
+  metrics to the cross-seed-median targets in
+  ``release/notebooks/_release_targets.json`` (per-metric tolerances;
+  AUC/Brier ±0.02, AP / top-decile ±0.05). The targets file is itself
+  audit-synced against ``release/validation/validation_report.json``
+  by ``test_release_targets_match_report.py``, so the gate
+  transitively pins to the validation report.
+* G13.3 — neither notebook touches ``release/intermediate_instructor``
+  or any other instructor artefact (enforced by the notebooks' own
+  ``BUNDLE = Path("../intermediate")`` path discipline; an instructor
+  load would fail the public-mode manifest assertion in cell 1).
+
+The test is gated on the public release bundles being on disk (matches
+the HF/Kaggle smoke-test pattern in ``tests/scripts/test_package_*``).
+``nbclient`` lives in the optional ``[notebooks]`` extra; ``importorskip``
+keeps the dev install lean while letting the dedicated CI job run the
+gate against ``pip install -e ".[dev,scripts,notebooks]"``.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+
+_REPO_ROOT = Path(__file__).resolve().parents[3]
+_NOTEBOOKS_DIR = _REPO_ROOT / "release" / "notebooks"
+_RELEASE_BUNDLES_PRESENT = (_REPO_ROOT / "release" / "intermediate" / "manifest.json").exists()
+
+_NOTEBOOKS = [
+    "01_baseline_lead_scoring.ipynb",
+    "02_relational_feature_engineering.ipynb",
+]
+
+
+@pytest.mark.skipif(not _RELEASE_BUNDLES_PRESENT, reason="release bundles not present")
+@pytest.mark.parametrize("notebook_name", _NOTEBOOKS)
+def test_notebook_executes_end_to_end(notebook_name: str) -> None:
+    """Execute the notebook with nbclient and surface any cell error.
+
+    Each notebook hard-codes ``BUNDLE = Path("../intermediate")`` and
+    ``sys.path.insert(0, str(Path.cwd()))`` to import the sibling
+    ``_notebook_utils`` module — both work iff the kernel cwd is the
+    notebook directory, so we set ``resources={"metadata": {"path": ...}}``
+    accordingly.
+    """
+    nbformat = pytest.importorskip("nbformat", reason="nbformat not installed")
+    nbclient = pytest.importorskip("nbclient", reason="nbclient not installed")
+    pytest.importorskip("sklearn", reason="scikit-learn not installed")
+    pytest.importorskip("matplotlib", reason="matplotlib not installed")
+
+    notebook_path = _NOTEBOOKS_DIR / notebook_name
+    nb = nbformat.read(notebook_path, as_version=4)
+    client = nbclient.NotebookClient(
+        nb,
+        timeout=180,
+        kernel_name="python3",
+        resources={"metadata": {"path": str(_NOTEBOOKS_DIR)}},
+    )
+    client.execute()
diff --git a/tests/release/notebooks/test_notebook_utils.py b/tests/release/notebooks/test_notebook_utils.py
new file mode 100644
index 0000000..80e2f59
--- /dev/null
+++ b/tests/release/notebooks/test_notebook_utils.py
@@ -0,0 +1,247 @@
+"""Unit tests for ``release/notebooks/_notebook_utils.py``.
+
+The notebook helpers ship inside the public bundle (consumers download
+them alongside the parquet tables), so they cannot live inside the
+``leadforge`` package import tree.  These tests load the module through
+``importlib`` and exercise it the way a notebook cell would.
+"""
+
+from __future__ import annotations
+
+import importlib.util
+import sys
+from pathlib import Path
+
+import numpy as np
+import pytest
+
+_MODULE_PATH = Path(__file__).resolve().parents[3] / "release" / "notebooks" / "_notebook_utils.py"
+_spec = importlib.util.spec_from_file_location("_notebook_utils", _MODULE_PATH)
+assert _spec is not None
+assert _spec.loader is not None
+nbu = importlib.util.module_from_spec(_spec)
+sys.modules["_notebook_utils"] = nbu
+_spec.loader.exec_module(nbu)
+
+
+# ---------------------------------------------------------------------------
+# precision_at_k
+# ---------------------------------------------------------------------------
+
+
+def test_precision_at_k_simple() -> None:
+    scores = np.array([0.9, 0.8, 0.7, 0.6, 0.5])
+    y = np.array([1, 1, 0, 0, 0])
+    assert nbu.precision_at_k(scores, y, 2) == pytest.approx(1.0)
+    assert nbu.precision_at_k(scores, y, 5) == pytest.approx(0.4)
+
+
+def test_precision_at_k_handles_ties_via_stable_sort() -> None:
+    """Ties resolved by original order — same convention as
+    ``release_quality._precision_at_k`` so notebook and validation
+    report agree on tied-score rows.
+    """
+    scores = np.array([0.5, 0.5, 0.5, 0.5])
+    y = np.array([1, 0, 1, 0])
+    # Stable argsort of -scores preserves [0,1,2,3] order, so top-2 = y[:2]
+    assert nbu.precision_at_k(scores, y, 2) == pytest.approx(0.5)
+
+
+def test_precision_at_k_invalid_k_returns_nan() -> None:
+    scores = np.array([0.9, 0.8])
+    y = np.array([1, 0])
+    assert np.isnan(nbu.precision_at_k(scores, y, 0))
+    assert np.isnan(nbu.precision_at_k(scores, y, 3))
+
+
+def test_top_decile_rate_uses_10_percent_cut() -> None:
+    rng = np.random.default_rng(0)
+    scores = rng.random(100)
+    y = (scores > 0.9).astype(int)
+    # Top-10 by score = exactly the 10 positives → top decile rate = 1.0
+    assert nbu.top_decile_rate(scores, y) == pytest.approx(1.0)
+
+
+def test_top_decile_rate_empty_returns_nan() -> None:
+    assert np.isnan(nbu.top_decile_rate(np.array([]), np.array([])))
+
+
+# ---------------------------------------------------------------------------
+# assert_within_tolerance
+# ---------------------------------------------------------------------------
+
+
+def test_assert_within_tolerance_passes_when_inside_band() -> None:
+    nbu.assert_within_tolerance(
+        observed={"auc": 0.88, "ap": 0.57},
+        target={"auc": 0.886, "ap": 0.575},
+        tolerances=0.05,
+    )
+
+
+def test_assert_within_tolerance_fails_when_outside_band() -> None:
+    with pytest.raises(AssertionError) as exc:
+        nbu.assert_within_tolerance(
+            observed={"auc": 0.50},
+            target={"auc": 0.886},
+            tolerances=0.05,
+        )
+    msg = str(exc.value)
+    assert "auc" in msg
+    assert "observed=0.5000" in msg
+    assert "target=0.8860" in msg
+
+
+def test_assert_within_tolerance_per_metric_tolerances() -> None:
+    nbu.assert_within_tolerance(
+        observed={"auc": 0.83, "brier": 0.105},
+        target={"auc": 0.886, "brier": 0.110},
+        tolerances={"auc": 0.10, "brier": 0.05},
+    )
+
+
+def test_assert_within_tolerance_reports_missing_key() -> None:
+    with pytest.raises(AssertionError, match="missing from observed metrics"):
+        nbu.assert_within_tolerance(
+            observed={"auc": 0.88},
+            target={"auc": 0.886, "brier": 0.110},
+            tolerances=0.05,
+        )
+
+
+def test_assert_within_tolerance_label_appears_in_error() -> None:
+    with pytest.raises(AssertionError, match="notebook 01"):
+        nbu.assert_within_tolerance(
+            observed={"auc": 0.5},
+            target={"auc": 0.886},
+            tolerances=0.05,
+            label="notebook 01",
+        )
+
+
+def test_assert_within_tolerance_aggregates_multiple_failures() -> None:
+    with pytest.raises(AssertionError) as exc:
+        nbu.assert_within_tolerance(
+            observed={"auc": 0.50, "ap": 0.10},
+            target={"auc": 0.886, "ap": 0.575},
+            tolerances=0.05,
+        )
+    msg = str(exc.value)
+    assert "auc" in msg
+    assert "ap" in msg
+
+
+def test_assert_within_tolerance_ignores_extra_observed_keys() -> None:
+    """Observed metrics may carry extras (e.g. GBM AUC); the gate only
+    enforces the keys present in ``target``.
+    """
+    nbu.assert_within_tolerance(
+        observed={"auc": 0.88, "ap": 0.57, "extra": 999.0},
+        target={"auc": 0.886, "ap": 0.575},
+        tolerances=0.05,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Silent-pass paths: NaN / inf observed and target, missing tolerance keys
+# ---------------------------------------------------------------------------
+
+
+def test_assert_within_tolerance_fails_on_nan_observed() -> None:
+    """``NaN > tol`` is ``False`` in IEEE 754; without an explicit
+    finiteness check a NaN-valued observed metric would slip through
+    the gate.
+    """
+    with pytest.raises(AssertionError, match="non-finite"):
+        nbu.assert_within_tolerance(
+            observed={"auc": float("nan")},
+            target={"auc": 0.886},
+            tolerances=0.05,
+        )
+
+
+def test_assert_within_tolerance_fails_on_inf_observed() -> None:
+    with pytest.raises(AssertionError, match="non-finite"):
+        nbu.assert_within_tolerance(
+            observed={"auc": float("inf")},
+            target={"auc": 0.886},
+            tolerances=0.05,
+        )
+
+
+def test_assert_within_tolerance_fails_on_nan_target() -> None:
+    """A NaN target would also produce a NaN diff and bypass the gate."""
+    with pytest.raises(AssertionError, match="non-finite"):
+        nbu.assert_within_tolerance(
+            observed={"auc": 0.85},
+            target={"auc": float("nan")},
+            tolerances=0.05,
+        )
+
+
+def test_assert_within_tolerance_rejects_incomplete_tolerance_mapping() -> None:
+    """A per-metric tolerances dict missing keys present in ``target``
+    used to default each missing key to ``+inf``, silently disabling the
+    gate for those metrics.  The fix is to fail up front with the list
+    of missing keys, treating it as a configuration error.
+    """
+    with pytest.raises(AssertionError, match="missing entries for target metrics"):
+        nbu.assert_within_tolerance(
+            observed={"auc": 0.88, "brier": 0.10},
+            target={"auc": 0.886, "brier": 0.110},
+            tolerances={"auc": 0.05},  # ``brier`` deliberately missing
+        )
+
+
+# ---------------------------------------------------------------------------
+# precision_at_k must mirror release_quality._precision_at_k
+# ---------------------------------------------------------------------------
+
+
+def test_precision_at_k_mirrors_release_quality() -> None:
+    """The notebook helper's docstring claims byte-equivalence with
+    ``leadforge.validation.release_quality._precision_at_k`` (same
+    stable argsort, same tie-breaking).  This test pins that claim:
+    if either implementation drifts, the notebook's reproduction gate
+    silently drifts with it.
+    """
+    from leadforge.validation import release_quality
+
+    rng = np.random.default_rng(0)
+    scores = rng.random(1000)
+    y = (rng.random(1000) > 0.7).astype(int)
+
+    for k in (1, 10, 50, 100, 250, 500, 999):
+        nbu_value = nbu.precision_at_k(scores, y, k)
+        rq_value = release_quality._precision_at_k(scores, y, k)
+        assert nbu_value == pytest.approx(rq_value), (
+            f"divergence at k={k}: notebook helper {nbu_value}, release_quality {rq_value}"
+        )
+
+    # Tied scores — the convention drift this is most likely to surface.
+    tied_scores = np.array([0.5, 0.5, 0.5, 0.5, 0.4, 0.4, 0.3])
+    tied_y = np.array([1, 0, 1, 0, 1, 1, 0])
+    for k in (1, 2, 4, 6, 7):
+        assert nbu.precision_at_k(tied_scores, tied_y, k) == pytest.approx(
+            release_quality._precision_at_k(tied_scores, tied_y, k)
+        )
+
+
+def test_top_decile_rate_mirrors_release_quality() -> None:
+    """``release_quality._top_decile_rate`` and the notebook helper share
+    the same ``max(1, int(round(n * 0.1)))`` k-selection rule today.
+    Lock that in: if either side ever changes (e.g. switches to
+    ``ceil`` or ``floor`` on edge cases), the gate would silently
+    diverge from the validation report.  Includes the exact `n` the
+    intermediate tier ships (``n_test = 750``) and a few small `n`
+    where banker's rounding bites.
+    """
+    from leadforge.validation import release_quality
+
+    rng = np.random.default_rng(0)
+    for n in (5, 10, 25, 99, 100, 750, 1234):
+        scores = rng.random(n)
+        y = (rng.random(n) > 0.7).astype(int)
+        assert nbu.top_decile_rate(scores, y) == pytest.approx(
+            release_quality._top_decile_rate(scores, y)
+        ), f"divergence at n={n}"
diff --git a/tests/release/notebooks/test_release_targets_match_report.py b/tests/release/notebooks/test_release_targets_match_report.py
new file mode 100644
index 0000000..70f4edf
--- /dev/null
+++ b/tests/release/notebooks/test_release_targets_match_report.py
@@ -0,0 +1,44 @@
+"""Audit-sync gate: ``release/notebooks/_release_targets.json`` must
+mirror the cross-seed-median values in
+``release/validation/validation_report.json``.
+
+Notebook 01 loads the targets file at runtime and pins its computed
+metrics against it via ``assert_within_tolerance`` (G13.2).  Without
+this audit, the targets file could silently drift from the validation
+report — at which point the notebook's "reproduction" gate stops
+reproducing anything in particular.  This test fails as soon as either
+file moves out of step.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+_REPO_ROOT = Path(__file__).resolve().parents[3]
+_TARGETS_PATH = _REPO_ROOT / "release" / "notebooks" / "_release_targets.json"
+_REPORT_PATH = _REPO_ROOT / "release" / "validation" / "validation_report.json"
+
+
+def test_release_targets_match_validation_report() -> None:
+    targets = json.loads(_TARGETS_PATH.read_text())
+    report = json.loads(_REPORT_PATH.read_text())
+
+    for tier_name, tier_targets in targets.items():
+        if tier_name.startswith("_"):
+            continue  # ``_doc`` and any other meta keys
+        assert tier_name in report["tiers"], (
+            f"targets file mentions tier {tier_name!r} which is absent from "
+            f"validation_report.json (known tiers: {list(report['tiers'])})"
+        )
+        report_medians = report["tiers"][tier_name]["medians"]
+        for metric_name, target_value in tier_targets.items():
+            assert metric_name in report_medians, (
+                f"{tier_name}.{metric_name}: pinned in targets file but absent "
+                f"from validation_report medians"
+            )
+            assert target_value == report_medians[metric_name], (
+                f"{tier_name}.{metric_name}: targets file has {target_value} "
+                f"but validation_report median is {report_medians[metric_name]} — "
+                "regenerate the report or update _release_targets.json"
+            )
diff --git a/tests/scripts/test_release_notebook_builders.py b/tests/scripts/test_release_notebook_builders.py
new file mode 100644
index 0000000..c539a77
--- /dev/null
+++ b/tests/scripts/test_release_notebook_builders.py
@@ -0,0 +1,82 @@
+"""Byte-stability gate for the release-notebook builders.
+
+The builders advertise an audit-artifact-sync invariant (PR 4.1 / 5.1 /
+5.2 pattern): re-running the builder must produce a byte-identical
+``.ipynb``, and the committed file under ``release/notebooks/`` must
+equal a fresh build.  Without this test the invariant is wishful
+thinking — ``nbformat.v4.new_*_cell`` randomises cell IDs by default,
+so an unguarded builder silently diverges on every run.
+
+The builders accept ``--out PATH`` so this test can build into
+``tmp_path`` instead of mutating the committed ``release/notebooks/``
+file.  That keeps the test parallel-safe (pytest-xdist running both
+parametrised cases at once won't race), interrupt-safe (an interrupted
+run can't leave the working tree dirty), and side-effect-free (the
+committed notebook is never touched).
+"""
+
+from __future__ import annotations
+
+import subprocess
+import sys
+from pathlib import Path
+
+import pytest
+
+_REPO_ROOT = Path(__file__).resolve().parents[2]
+_SCRIPTS_DIR = _REPO_ROOT / "scripts"
+_NOTEBOOKS_DIR = _REPO_ROOT / "release" / "notebooks"
+
+_BUILDERS: list[tuple[str, str]] = [
+    ("build_release_notebook_01.py", "01_baseline_lead_scoring.ipynb"),
+    ("build_release_notebook_02.py", "02_relational_feature_engineering.ipynb"),
+]
+
+
+@pytest.mark.parametrize(("builder_name", "notebook_name"), _BUILDERS)
+def test_builder_is_byte_stable_and_matches_committed(
+    tmp_path: Path,
+    builder_name: str,
+    notebook_name: str,
+) -> None:
+    """Build twice into ``tmp_path`` via ``--out``; assert the two runs
+    produce byte-identical output and that the committed notebook
+    matches them.
+    """
+    # ``nbformat`` lives in the optional ``[notebooks]`` extra; the
+    # main ``test`` CI job installs only ``[dev]`` and would otherwise
+    # see the subprocess-invoked builders crash with
+    # ``ModuleNotFoundError: nbformat``.  The dedicated ``notebooks``
+    # CI job installs ``[dev,scripts,notebooks]`` and runs this test
+    # alongside ``test_execute_notebooks.py``.
+    pytest.importorskip("nbformat", reason="nbformat not installed (use [notebooks] extra)")
+
+    builder_path = _SCRIPTS_DIR / builder_name
+    committed_path = _NOTEBOOKS_DIR / notebook_name
+    assert builder_path.exists(), f"missing builder: {builder_path}"
+    assert committed_path.exists(), f"missing committed notebook: {committed_path}"
+
+    run_a = tmp_path / "run_a.ipynb"
+    run_b = tmp_path / "run_b.ipynb"
+
+    subprocess.run(  # noqa: S603 — sys.executable + repo-internal builder path
+        [sys.executable, str(builder_path), "--out", str(run_a)],
+        check=True,
+        cwd=_REPO_ROOT,
+    )
+    subprocess.run(  # noqa: S603 — sys.executable + repo-internal builder path
+        [sys.executable, str(builder_path), "--out", str(run_b)],
+        check=True,
+        cwd=_REPO_ROOT,
+    )
+
+    assert run_a.read_bytes() == run_b.read_bytes(), (
+        f"{builder_name}: two runs produced different bytes — cell IDs are "
+        "non-deterministic; pass an explicit ``id=`` to nbformat cell "
+        "constructors (see scripts/_release_notebook_common.py)"
+    )
+    assert committed_path.read_bytes() == run_a.read_bytes(), (
+        f"{notebook_name}: committed file does not match a fresh build of "
+        f"{builder_name} — re-run the builder and commit the result "
+        "(audit-artifact-sync, same pattern as PR 4.1 / 5.1 / 5.2)"
+    )