leadforge-dev · shaypal5 · May 7, 2026 · May 7, 2026 · May 7, 2026 · May 7, 2026
diff --git a/.agent-plan.md b/.agent-plan.md
@@ -58,8 +58,8 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
 - [x] PR 5.2 (self-review pass): brutal review of the first revision caught real bugs the reviewer would otherwise have to call out. Fixes folded into the PR before review:  (#1) `run_packager` validate→write order — both packagers were writing the README/metadata even when validation failed; the validation gate now early-returns with `errors` populated and zero artifacts on disk; new tests exercise the no-write path on both sides.  (#2) Instructor README was inlining the public 3-tier README for a 1-tier dataset; replaced with a dedicated `INSTRUCTOR_BODY` constant that opens by linking to the public dataset, describes only the instructor-specific additions (full-horizon tables, hidden DAG, latent registry, mechanism summary), and uses the single-tier tree block.  (#3) `validate_upload_dir_safe` now also rejects strict descendants of `release_dir` (e.g. `--huggingface-dir release/intro` would otherwise rmtree the intro bundle); allow-list keeps the canonical `release/{kaggle,huggingface,huggingface-instructor}` direct-children.  (#4) `[publish]` extra in `pyproject.toml` (`datasets>=2.14`, `kaggle>=1.6`) makes the gated `load_dataset()` / Kaggle-CLI tests installable in a single command — closes the "G12.3/G12.4 untested in CI" gap to a one-line install.  (#5) Shared-primitives extraction finished: `SOURCE_TREE_BLOCK`, `validate_readme_substitution`, `replace_file`, `replace_dir`, `load_manifest` all moved to `scripts/_release_common.py`; both packagers reduced to imports.  (#6) Hand-rolled YAML renderer (60 lines + brittle quoting heuristic) replaced with `yaml.safe_dump` + a 4-line `_IndentedDumper` subclass that forces indent-2 on top-level sequences.  (#7) Dead `--owner` / `--dataset-slug` CLI flags removed (PR 7.2 will add them when actually needed).  (#8) `assemble_upload_dir` now takes `rendered_readme` as a parameter and writes it itself; the public name no longer lies about producing a complete tree.  (#9) `build_config_for_tier` made pure (no I/O); `_assert_tier_dir_exists` does the cheap manifest-stat preflight.  (#10) `--default-config` with `--variant=instructor` now errors instead of silently ignoring.  (#11) Instructor tree-diagram drops the hardcoded "9 tables" claim.  (#13–#16) Visual cleanups (duplicate divider, ruff-split imports, `COVER_IMAGE_FILENAME`-vs-`Path.name` redundancy, speculative comment about HF split rename).  (#17) Test cruft removed (unused `tmp_path`, dead `tag_lines`); em-dash YAML round-trip parametrised for the instructor `pretty_name`.  Net: 1223/1223 tests pass + 5 gated skips (4 `datasets`-SDK round-trip + 1 Kaggle `kaggle`-SDK from PR 5.1); ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR doesn't touch the bundle shape).
 
 ### Phase 6 — Notebook sequence + adversarial framing
-- [ ] `release/notebooks/{02_relational_feature_engineering,03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb`
-- [ ] Update `01_baseline_lead_scoring.ipynb` to reproduce validation report metrics
+- [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added.  Notebook 01 trains LR + HistGBM on the public `intermediate` bundle using the **same feature set as the validation report** (drops only IDs and the label, mirrors `release_quality._partition_columns`), so the G13.2 reproduction gate compares apples to apples.  This means notebook 01 **keeps** `total_touches_all` (the documented leakage trap) — narrative cell calls it out explicitly and forward-points to notebook 03 (PR 6.2) which dissects what dropping the trap does to performance.  Notebook 02 by contrast **drops** the trap from the flat baseline so the relational lift attribution stays clean (its goal is teaching feature engineering, not reproducing the report).  Targets are loaded at runtime from `release/notebooks/_release_targets.json` (audit-synced against `release/validation/validation_report.json` by `tests/release/notebooks/test_release_targets_match_report.py`); per-metric tolerances replace the original flat ±0.05 (AUC/Brier ±0.02, AP / top-decile ±0.05).  Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on **train leads only**, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, prints a 4-row metric panel + delta panel, and pins the four model AUCs and the headline `GBM(eng) − GBM(flat)` lift via `assert_within_tolerance` (sign-aware `assert lift > 0` on top of the absolute tolerance).  Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive (the cross-seed `gbm_auc` spread on this bundle is ~0.027); seed-sweep harness lands in PR 6.2's notebook 04.  Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` and `top_decile_rate` mirror `release_quality._precision_at_k` / `_top_decile_rate` (locked in by mirror tests), and `assert_within_tolerance` is hardened against silent passes on non-finite metrics or incomplete per-metric tolerance maps.  G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the intermediate bundle via `python scripts/build_public_release.py release --tier intermediate` (only tier the notebooks need) then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised, gated on bundles-present).  G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`.  Builders (`scripts/build_release_notebook_{01,02}.py`, sharing `scripts/_release_notebook_common.py`) emit deterministic byte-for-byte notebook JSON via explicit `cell_NNN` IDs (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2, locked in by `tests/scripts/test_release_notebook_builders.py` which builds twice into `tmp_path` via the new `--out PATH` flag and diffs against the committed file without ever touching the working tree) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree.  Net: 1250/1250 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
+- [ ] `release/notebooks/{03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb`
 - [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml`
 - [ ] `docs/release/{break_me_guide,v2_decision_log}.md`
 

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -18,7 +18,11 @@ jobs:
       - uses: actions/setup-python@v5
         with:
           python-version: "3.12"
-      - run: pip install ruff
+      # Pin ruff so notebook .ipynb formatting (which the lint job is
+      # strict about) doesn't drift between contributor laptops and CI
+      # the moment a new ruff release ships.  Bump in lock-step with
+      # any local re-runs of ``scripts/build_release_notebook_*.py``.
+      - run: pip install 'ruff==0.15.12'
       - run: ruff check .
       - run: ruff format --check .
 
@@ -138,3 +142,22 @@ jobs:
       - name: Skip v7 (no dataset)
         if: steps.check-v7.outputs.found != 'true'
         run: echo "No v7 datasets found — skipping v7 validation"
+
+  notebooks:
+    name: Execute release notebooks (G13.1)
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - run: pip install -e ".[dev,scripts,notebooks]"
+      - name: Register python3 kernelspec for nbclient
+        run: python -m ipykernel install --user --name python3
+      - name: Build the intermediate public bundle (only tier the notebooks need)
+        run: python scripts/build_public_release.py release --tier intermediate
+      - name: Execute release notebooks end-to-end + builder byte-stability
+        run: |
+          pytest tests/release/notebooks/test_execute_notebooks.py \
+                 tests/scripts/test_release_notebook_builders.py \
+                 -v
diff --git a/pyproject.toml b/pyproject.toml
@@ -54,6 +54,25 @@ publish = [
     "datasets>=2.14",
     "kaggle>=1.6",
 ]
+# Optional dependencies for executing the public release notebooks.
+# Installing this extra (``pip install -e ".[notebooks]"``) enables the
+# G13.1 acceptance gate: the CI ``notebooks`` job nbclient-executes
+# ``release/notebooks/*.ipynb`` end-to-end against a freshly built
+# public bundle, asserting the notebooks reproduce validation_report.md
+# metrics within ±0.05 (G13.2) and never load instructor artefacts
+# (G13.3).
+notebooks = [
+    "nbclient>=0.10",
+    "nbformat>=5.10",
+    # ``ipykernel`` provides the ``python3`` kernelspec that
+    # ``nbclient.NotebookClient(..., kernel_name="python3")`` looks up.
+    # Without it CI fails with ``NoSuchKernel: No such kernel named
+    # python3`` because the GitHub-hosted runner has no kernelspecs
+    # registered out of the box (local dev environments usually do).
+    "ipykernel>=6.0",
+    "scikit-learn>=1.3",
+    "matplotlib>=3.7",
+]
 
 [project.scripts]
 leadforge = "leadforge.cli.main:app"
@@ -74,6 +93,16 @@ select = ["E", "F", "I", "N", "W", "UP", "B", "C4", "PT", "S"]
 
 [tool.ruff.lint.per-file-ignores]
 "tests/**/*" = ["S101", "S108"]
+# Release notebooks deliberately use ``assert`` for contract checks
+# (path discipline, snapshot-safe joins, G13.2 tolerance gate).  ``assert``
+# is the conventional notebook idiom and these are exactly the cells we
+# want to fail loud on regression.
+"release/notebooks/**/*.ipynb" = ["S101"]
+# Notebook builders emit dedented heredocs whose lines render as
+# markdown tables and print-statement output inside the notebook.
+# Line length is a property of the rendered cell, not the .py source,
+# so 100c is the wrong yardstick here.
+"scripts/build_release_notebook_*.py" = ["E501"]
 
 [tool.mypy]
 python_version = "3.11"