From f109a2149fbd775f7d69c8cf8fa378a3cebba879 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Thu, 7 May 2026 07:38:46 +0300 Subject: [PATCH 1/6] PR 6.1: notebook 01 refresh + notebook 02 relational feature engineering + G13.1 CI gate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Refreshes ``release/notebooks/01_baseline_lead_scoring.ipynb`` and adds ``release/notebooks/02_relational_feature_engineering.ipynb``, the first two notebooks of the Phase 6 sequence. Both ship inside the public bundle alongside the parquet tables (Kaggle/HF consumers download them together), so they import a sibling ``_notebook_utils.py`` rather than relying on the ``leadforge`` package being installed. Notebook 01 (refresh) — Reproduces ``release/validation/validation_report.md`` intermediate-tier metrics within +/-0.05 (G13.2): trains LR + HistGBM on the public ``intermediate`` bundle, reports AUC / AP / P@K(50/100/200) / Brier, drops the documented ``total_touches_all`` leakage trap (still mentions it for pedagogy + forward-points to notebook 03), then calls ``assert_within_tolerance`` to fail loud on drift before rendering the decile-lift bar chart and 10-bin reliability diagram. Hard-codes ``BUNDLE = Path("../intermediate")`` and asserts ``manifest.exposure_mode == "student_public"`` so the public-only path discipline (G13.3) is enforced inline. Notebook 02 (new) — Loads the seven snapshot-safe public tables (``customers`` / ``subscriptions`` are not in public bundles per ``BANNED_TABLES``), asserts every ``timestamp <= lead_created_at + snapshot_day`` for all four event tables, demonstrates four legal joins (touch-channel breakdown, account-level density, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, and prints a 4-row metric panel + delta panel. The honest takeaway cell quotes G7.4.4: GBM(eng)-GBM(flat) is the headline lift, GBM(eng)-LR(flat) is still slightly negative, consistent with the v1 finding that flat-CSV LR is hard to beat. Helpers + tests: - ``release/notebooks/_notebook_utils.py``: ``precision_at_k`` (stable argsort, mirrors ``release_quality._precision_at_k``), ``top_decile_rate``, ``assert_within_tolerance`` (per-metric or scalar tolerance, aggregates failures). Imported by both notebooks via ``sys.path.insert(0, str(Path.cwd())); from _notebook_utils import ...``. - ``tests/release/notebooks/test_notebook_utils.py``: 12 unit tests on the helpers (loaded via ``importlib.util`` since the module isn't on the package import path). - ``tests/release/notebooks/test_execute_notebooks.py``: parametrised nbclient end-to-end execution of both notebooks, gated on the public release bundles being on disk (matches ``test_package_*`` skip pattern). CI wiring (G13.1): - ``[notebooks]`` extra in ``pyproject.toml`` (nbclient, nbformat, scikit-learn, matplotlib) -- keeps the dev install lean while letting the dedicated CI job run the gate. - New ``notebooks`` job in ``.github/workflows/ci.yml`` that installs ``[dev,scripts,notebooks]``, regenerates the public release bundles via ``scripts/build_public_release.py`` (~6 s wall clock), then runs the nbclient execution test. Builders: - ``scripts/_build_release_notebook_01.py`` and ``_build_release_notebook_02.py`` are nbformat-driven one-shot builders so notebook source stays reviewable in the PR diff (instead of hand-edited JSON). Each builder shells out to ``ruff format`` on its emitted file, so the builder output and pre-commit hook agree byte-for-byte (preserves the audit-artifact-sync invariant PR 4.1 / 5.1 / 5.2 established). Per-file ruff ignore for E501 because dedented heredoc lines render as notebook content (markdown tables, print-statement output) where 100c is the wrong yardstick. Acceptance results on this branch: - ruff check . / ruff format --check . -- clean - mypy leadforge/ -- clean (82 source files) - pytest -- 1246 passed, 5 skipped (publish-extra-gated) - ``probe_relational_leakage`` on intro/intermediate/advanced -- 0/3 paths flag rate on every tier - ``verify_hash_determinism`` -- PASS 67/67 - ``validate_release_candidate --no-rebuild`` -- PASS, 0 leakage findings - ``BUNDLE_SCHEMA_VERSION`` unchanged at 5 Co-Authored-By: Claude Opus 4.7 --- .github/workflows/ci.yml | 14 + pyproject.toml | 23 + .../notebooks/01_baseline_lead_scoring.ipynb | 449 +++++++------- .../02_relational_feature_engineering.ipynb | 474 +++++++++++++++ release/notebooks/_notebook_utils.py | 87 +++ scripts/_build_release_notebook_01.py | 430 +++++++++++++ scripts/_build_release_notebook_02.py | 570 ++++++++++++++++++ tests/release/__init__.py | 0 tests/release/notebooks/__init__.py | 0 .../notebooks/test_execute_notebooks.py | 62 ++ .../release/notebooks/test_notebook_utils.py | 142 +++++ 11 files changed, 2018 insertions(+), 233 deletions(-) create mode 100644 release/notebooks/02_relational_feature_engineering.ipynb create mode 100644 release/notebooks/_notebook_utils.py create mode 100644 scripts/_build_release_notebook_01.py create mode 100644 scripts/_build_release_notebook_02.py create mode 100644 tests/release/__init__.py create mode 100644 tests/release/notebooks/__init__.py create mode 100644 tests/release/notebooks/test_execute_notebooks.py create mode 100644 tests/release/notebooks/test_notebook_utils.py diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 323f858..afc1fff 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -138,3 +138,17 @@ jobs: - name: Skip v7 (no dataset) if: steps.check-v7.outputs.found != 'true' run: echo "No v7 datasets found — skipping v7 validation" + + notebooks: + name: Execute release notebooks (G13.1) + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + - run: pip install -e ".[dev,scripts,notebooks]" + - name: Build public release bundles (intermediate tier) + run: python scripts/build_public_release.py release + - name: Execute release notebooks end-to-end + run: pytest tests/release/notebooks/test_execute_notebooks.py -v diff --git a/pyproject.toml b/pyproject.toml index b7f1de8..e689338 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -54,6 +54,19 @@ publish = [ "datasets>=2.14", "kaggle>=1.6", ] +# Optional dependencies for executing the public release notebooks. +# Installing this extra (``pip install -e ".[notebooks]"``) enables the +# G13.1 acceptance gate: the CI ``notebooks`` job nbclient-executes +# ``release/notebooks/*.ipynb`` end-to-end against a freshly built +# public bundle, asserting the notebooks reproduce validation_report.md +# metrics within ±0.05 (G13.2) and never load instructor artefacts +# (G13.3). +notebooks = [ + "nbclient>=0.10", + "nbformat>=5.10", + "scikit-learn>=1.3", + "matplotlib>=3.7", +] [project.scripts] leadforge = "leadforge.cli.main:app" @@ -74,6 +87,16 @@ select = ["E", "F", "I", "N", "W", "UP", "B", "C4", "PT", "S"] [tool.ruff.lint.per-file-ignores] "tests/**/*" = ["S101", "S108"] +# Release notebooks deliberately use ``assert`` for contract checks +# (path discipline, snapshot-safe joins, G13.2 tolerance gate). ``assert`` +# is the conventional notebook idiom and these are exactly the cells we +# want to fail loud on regression. +"release/notebooks/**/*.ipynb" = ["S101"] +# Notebook builders emit dedented heredocs whose lines render as +# markdown tables and print-statement output inside the notebook. +# Line length is a property of the rendered cell, not the .py source, +# so 100c is the wrong yardstick here. +"scripts/_build_release_notebook_*.py" = ["E501"] [tool.mypy] python_version = "3.11" diff --git a/release/notebooks/01_baseline_lead_scoring.ipynb b/release/notebooks/01_baseline_lead_scoring.ipynb index 8861584..f9281b2 100644 --- a/release/notebooks/01_baseline_lead_scoring.ipynb +++ b/release/notebooks/01_baseline_lead_scoring.ipynb @@ -2,354 +2,337 @@ "cells": [ { "cell_type": "markdown", + "id": "68d902d7", "metadata": {}, - "source": [ - "# Baseline Lead Scoring Models\n", - "\n", - "This notebook trains baseline models on the **LeadForge B2B Lead Scoring** dataset.\n", - "It works directly from the pre-generated Parquet files -- no `leadforge` installation required.\n", - "\n", - "We'll cover:\n", - "1. Loading the task splits\n", - "2. Exploring the features\n", - "3. Training Logistic Regression and Gradient Boosting baselines\n", - "4. Evaluating with AUC, PR-AUC, and Precision@K\n", - "5. Value-aware ranking (probability vs. expected value)\n", - "6. Feature importance\n", - "\n", - "**Requirements:** `pandas`, `scikit-learn`, `matplotlib` (all available in Kaggle notebooks by default)." - ] + "source": "# Notebook 01 — Baseline Lead Scoring\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier (the\nrelease default).\n\n**Goal:** train Logistic Regression and Histogram Gradient Boosting\nbaselines on the snapshot-safe public bundle, and verify they\nreproduce the cross-seed-median metrics in\n[`release/validation/validation_report.md`](../validation/validation_report.md)\nwithin the **±0.05** tolerance fixed by acceptance gate **G13.2**.\n\n**Public path discipline (G13.3).** This notebook reads only from\nthe public bundle at `release/intermediate/`. The instructor\ncompanion (`release/intermediate_instructor/`, with full-horizon\nevent tables, the latent registry, the hidden DAG, and the\nmechanism summary) is **not** loaded — public modelling work must\nnever depend on instructor-only artefacts." }, { "cell_type": "markdown", + "id": "b89f80f8", "metadata": {}, - "source": [ - "## 1. Load the data\n", - "\n", - "We use the `intermediate` difficulty tier. Change the path to `intro/` or `advanced/` to try other tiers." - ] + "source": "## 1. Setup" }, { "cell_type": "code", "execution_count": null, + "id": "3c83d9f4", "metadata": {}, "outputs": [], "source": [ + "from __future__ import annotations\n", + "\n", + "import sys\n", + "from pathlib import Path\n", + "\n", "import numpy as np\n", "import pandas as pd\n", + "from sklearn.compose import ColumnTransformer\n", + "from sklearn.ensemble import HistGradientBoostingClassifier\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.metrics import (\n", + " average_precision_score,\n", + " brier_score_loss,\n", + " roc_auc_score,\n", + ")\n", + "from sklearn.pipeline import Pipeline\n", + "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", "\n", - "# Adjust this path to your dataset location\n", - "BUNDLE = \"../intermediate\"\n", - "TASK = \"converted_within_90_days\"\n", - "\n", - "train = pd.read_parquet(f\"{BUNDLE}/tasks/{TASK}/train.parquet\")\n", - "valid = pd.read_parquet(f\"{BUNDLE}/tasks/{TASK}/valid.parquet\")\n", - "test = pd.read_parquet(f\"{BUNDLE}/tasks/{TASK}/test.parquet\")\n", + "sys.path.insert(0, str(Path.cwd()))\n", + "from _notebook_utils import assert_within_tolerance, precision_at_k\n", "\n", - "print(f\"Train: {len(train):,} rows\")\n", - "print(f\"Valid: {len(valid):,} rows\")\n", - "print(f\"Test: {len(test):,} rows\")\n", - "print(\"\\nConversion rates:\")\n", - "for name, df in [(\"train\", train), (\"valid\", valid), (\"test\", test)]:\n", - " print(f\" {name}: {df[TASK].mean():.1%}\")" + "SEED = 42\n", + "BUNDLE = Path(\"../intermediate\") # public student bundle\n", + "TASK = \"converted_within_90_days\"" ] }, + { + "cell_type": "markdown", + "id": "32596245", + "metadata": {}, + "source": "## 2. Reproduction targets\n\nWe pin the cross-seed median metrics from\n`release/validation/validation_report.md` (intermediate tier).\nEach is a Logistic Regression score on the test split unless\notherwise noted; ranking-based metrics use stable argsort on the\nLR probability so ties resolve identically to the validator." + }, { "cell_type": "code", "execution_count": null, + "id": "05e1df57", "metadata": {}, "outputs": [], "source": [ - "# Feature dictionary\n", - "feat_dict = pd.read_csv(f\"{BUNDLE}/feature_dictionary.csv\")\n", - "feat_dict" + "# Cross-seed medians from release/validation/validation_report.md\n", + "# (intermediate tier; seeds 42-46). See ``$.tiers.intermediate.medians``\n", + "# in the JSON sibling for the source of truth.\n", + "VALIDATION_REPORT_TARGETS = {\n", + " \"lr_auc\": 0.8859,\n", + " \"gbm_auc\": 0.8755,\n", + " \"lr_average_precision\": 0.5752,\n", + " \"lr_brier\": 0.1096,\n", + " \"lr_precision_at_100\": 0.59,\n", + "}\n", + "TOLERANCE = 0.05 # G13.2" ] }, { "cell_type": "markdown", + "id": "8fb8dbb6", "metadata": {}, - "source": [ - "## 2. Explore the features" - ] + "source": "## 3. Load the bundle\n\nWe load the parquet task splits — the canonical format the\nrelease ships in. The accompanying `lead_scoring.csv` is a\nconvenience export with the same rows but coerced dtypes;\nsticking with parquet preserves nullable `Int64` / `Float64` /\n`boolean` columns the way the validator sees them." }, { "cell_type": "code", "execution_count": null, + "id": "b58b73cf", "metadata": {}, "outputs": [], "source": [ - "# Identify feature types\n", - "TARGET = TASK\n", - "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n", - "LEAKAGE_COLS = [c for c in train.columns if feat_dict[feat_dict[\"name\"] == c][\"leakage_risk\"].any()]\n", + "import json\n", "\n", - "print(f\"Leakage-flagged columns (excluded): {LEAKAGE_COLS}\")\n", + "train = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"train.parquet\")\n", + "valid = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"valid.parquet\")\n", + "test = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"test.parquet\")\n", "\n", - "feature_cols = [c for c in train.columns if c not in ID_COLS + [TARGET] + LEAKAGE_COLS]\n", - "cat_cols = [c for c in feature_cols if train[c].dtype == \"string\" or train[c].dtype == \"object\"]\n", - "bool_cols = [c for c in feature_cols if train[c].dtype == \"boolean\"]\n", - "num_cols = [c for c in feature_cols if c not in cat_cols + bool_cols]\n", + "with (BUNDLE / \"manifest.json\").open() as fh:\n", + " manifest = json.load(fh)\n", "\n", - "print(f\"\\nCategorical: {len(cat_cols)} -- {cat_cols}\")\n", - "print(f\"Boolean: {len(bool_cols)} -- {bool_cols}\")\n", - "print(f\"Numeric: {len(num_cols)} -- {num_cols}\")" + "assert manifest[\"exposure_mode\"] == \"student_public\", \"this notebook expects the public bundle\"\n", + "assert manifest[\"relational_snapshot_safe\"] is True\n", + "\n", + "print(f\"Train: {len(train):,} rows\")\n", + "print(f\"Valid: {len(valid):,} rows (held out — not used here)\")\n", + "print(f\"Test: {len(test):,} rows\")\n", + "print()\n", + "print(f\"Bundle exposure_mode: {manifest['exposure_mode']}\")\n", + "print(f\"Bundle snapshot_day: {manifest['snapshot_day']}\")\n", + "print(f\"Bundle horizon_days: {manifest['horizon_days']}\")\n", + "print()\n", + "print(\"Conversion rates:\")\n", + "for name, df in [(\"train\", train), (\"valid\", valid), (\"test\", test)]:\n", + " print(f\" {name}: {df[TASK].mean():.1%}\")" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", + "id": "8e0e964c", "metadata": {}, - "outputs": [], - "source": [ - "# Missing values\n", - "missing = train[feature_cols].isnull().sum()\n", - "missing = missing[missing > 0].sort_values(ascending=False)\n", - "if len(missing) > 0:\n", - " print(\"Missing values in training set:\")\n", - " for col, count in missing.items():\n", - " print(f\" {col}: {count} ({count / len(train):.1%})\")\n", - "else:\n", - " print(\"No missing values in training set.\")" - ] + "source": "## 4. Feature selection\n\nThe feature dictionary flags one column as a deliberate **leakage\ntrap**: `total_touches_all` counts touches over the full 90-day\nhorizon (post-snapshot data). We drop it from the baseline so\nthe comparison against the validation report is honest.\n\nNotebook 03 (the leakage walkthrough, shipping in PR 6.2) exists\nspecifically to show what happens if you keep it." }, { "cell_type": "code", "execution_count": null, + "id": "d0242bbb", "metadata": {}, "outputs": [], "source": [ - "# Summary statistics for numeric features\n", - "train[num_cols].describe().T" + "feat_dict = pd.read_csv(BUNDLE / \"feature_dictionary.csv\")\n", + "trap_cols = feat_dict.loc[feat_dict[\"leakage_risk\"].astype(bool), \"name\"].tolist()\n", + "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n", + "EXCLUDE = set(ID_COLS + trap_cols + [TASK])\n", + "\n", + "feature_cols = [c for c in train.columns if c not in EXCLUDE]\n", + "cat_cols = [\n", + " c\n", + " for c in feature_cols\n", + " if not (pd.api.types.is_bool_dtype(train[c]) or pd.api.types.is_numeric_dtype(train[c]))\n", + "]\n", + "num_cols = [c for c in feature_cols if c not in cat_cols]\n", + "\n", + "print(f\"Leakage-trap columns dropped: {trap_cols}\")\n", + "print(f\"Categorical features ({len(cat_cols)}): {cat_cols}\")\n", + "print(f\"Numeric features ({len(num_cols)}): {num_cols}\")" ] }, { "cell_type": "markdown", + "id": "24db5070", "metadata": {}, - "source": [ - "## 3. Build preprocessing pipeline" - ] + "source": "## 5. Preprocessing pipeline\n\nMirrors `leadforge.validation.release_quality._build_pipeline`\nso the notebook's metric panel and the validation report's\nmetric panel agree by construction:\n\n- numeric: median-impute, then `StandardScaler`\n- categorical: most-frequent-impute, then dense `OneHotEncoder`\n with `handle_unknown=\"ignore\"`" }, { "cell_type": "code", "execution_count": null, + "id": "aa90d33c", "metadata": {}, "outputs": [], "source": [ - "from sklearn.compose import ColumnTransformer\n", - "from sklearn.impute import SimpleImputer\n", - "from sklearn.pipeline import Pipeline\n", - "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", - "\n", - "\n", - "# Convert boolean columns to int for sklearn\n", - "def prep_df(df):\n", - " out = df[feature_cols].copy()\n", - " for c in bool_cols:\n", - " out[c] = out[c].astype(\"Int64\")\n", + "def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame:\n", + " out = df.copy()\n", + " for c in cat_cols:\n", + " out[c] = out[c].astype(object).where(out[c].notna(), None)\n", " return out\n", "\n", "\n", - "numeric_features = num_cols + bool_cols\n", - "categorical_features = cat_cols\n", + "x_train = _sanitize_categoricals(train[feature_cols])\n", + "x_test = _sanitize_categoricals(test[feature_cols])\n", + "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).values\n", + "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).values\n", "\n", - "preprocessor = ColumnTransformer(\n", - " transformers=[\n", - " (\n", - " \"num\",\n", - " Pipeline(\n", - " [\n", - " (\"imputer\", SimpleImputer(strategy=\"median\")),\n", - " (\"scaler\", StandardScaler()),\n", - " ]\n", - " ),\n", - " numeric_features,\n", - " ),\n", - " (\n", - " \"cat\",\n", - " Pipeline(\n", - " [\n", - " (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n", - " (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n", - " ]\n", - " ),\n", - " categorical_features,\n", - " ),\n", + "numeric_t = Pipeline([(\"imputer\", SimpleImputer(strategy=\"median\")), (\"scaler\", StandardScaler())])\n", + "categorical_t = Pipeline(\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n", + " (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n", " ]\n", ")\n", - "\n", - "X_train = prep_df(train)\n", - "y_train = train[TARGET].astype(int)\n", - "X_test = prep_df(test)\n", - "y_test = test[TARGET].astype(int)\n", - "\n", - "print(f\"X_train: {X_train.shape}, X_test: {X_test.shape}\")" + "preprocessor = ColumnTransformer(\n", + " transformers=[\n", + " (\"num\", numeric_t, num_cols),\n", + " (\"cat\", categorical_t, cat_cols),\n", + " ],\n", + " remainder=\"drop\",\n", + ")" ] }, { "cell_type": "markdown", + "id": "e2fd9f82", "metadata": {}, - "source": [ - "## 4. Train baselines and evaluate" - ] + "source": "## 6. Train baselines and score the test split" }, { "cell_type": "code", "execution_count": null, + "id": "972b8207", "metadata": {}, "outputs": [], "source": [ - "from sklearn.ensemble import GradientBoostingClassifier\n", - "from sklearn.linear_model import LogisticRegression\n", - "from sklearn.metrics import average_precision_score, roc_auc_score\n", - "\n", - "models = {\n", - " \"Logistic Regression\": LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=42),\n", - " \"Gradient Boosting\": GradientBoostingClassifier(n_estimators=200, random_state=42),\n", - "}\n", - "\n", - "results = []\n", - "fitted_models = {}\n", - "\n", - "for name, model in models.items():\n", - " pipe = Pipeline([(\"preprocess\", preprocessor), (\"model\", model)])\n", - " pipe.fit(X_train, y_train)\n", - " y_prob = pipe.predict_proba(X_test)[:, 1]\n", - "\n", - " auc = roc_auc_score(y_test, y_prob)\n", - " pr_auc = average_precision_score(y_test, y_prob)\n", + "lr_pipe = Pipeline(\n", + " [\n", + " (\"preprocessor\", preprocessor),\n", + " (\n", + " \"classifier\",\n", + " LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=SEED),\n", + " ),\n", + " ]\n", + ")\n", + "gbm_pipe = Pipeline(\n", + " [\n", + " (\"preprocessor\", preprocessor),\n", + " (\"classifier\", HistGradientBoostingClassifier(random_state=SEED)),\n", + " ]\n", + ")\n", "\n", - " # Precision@K\n", - " for k in [25, 50, 100]:\n", - " top_k_idx = np.argsort(-y_prob)[:k]\n", - " p_at_k = y_test.iloc[top_k_idx].mean()\n", - " base_rate = y_test.mean()\n", - " lift = p_at_k / base_rate\n", - " results.append(\n", - " {\n", - " \"Model\": name,\n", - " \"Metric\": f\"P@{k}\",\n", - " \"Value\": f\"{p_at_k:.3f}\",\n", - " \"Lift\": f\"{lift:.2f}x\",\n", - " }\n", - " )\n", + "lr_pipe.fit(x_train, y_train)\n", + "gbm_pipe.fit(x_train, y_train)\n", "\n", - " results.append({\"Model\": name, \"Metric\": \"ROC-AUC\", \"Value\": f\"{auc:.3f}\", \"Lift\": \"\"})\n", - " results.append({\"Model\": name, \"Metric\": \"PR-AUC\", \"Value\": f\"{pr_auc:.3f}\", \"Lift\": \"\"})\n", - " fitted_models[name] = pipe\n", - " print(f\"{name}: AUC={auc:.3f}, PR-AUC={pr_auc:.3f}\")\n", + "lr_probs = lr_pipe.predict_proba(x_test)[:, 1]\n", + "gbm_probs = gbm_pipe.predict_proba(x_test)[:, 1]\n", "\n", - "pd.DataFrame(results)" + "metrics = {\n", + " \"lr_auc\": float(roc_auc_score(y_test, lr_probs)),\n", + " \"gbm_auc\": float(roc_auc_score(y_test, gbm_probs)),\n", + " \"lr_average_precision\": float(average_precision_score(y_test, lr_probs)),\n", + " \"lr_brier\": float(brier_score_loss(y_test, lr_probs)),\n", + " \"lr_precision_at_50\": precision_at_k(lr_probs, y_test, 50),\n", + " \"lr_precision_at_100\": precision_at_k(lr_probs, y_test, 100),\n", + " \"lr_precision_at_200\": precision_at_k(lr_probs, y_test, 200),\n", + "}\n", + "for k, v in metrics.items():\n", + " print(f\" {k:<24s} {v:.4f}\")" ] }, { "cell_type": "markdown", + "id": "dcafd0ff", "metadata": {}, - "source": [ - "## 5. Value-aware ranking\n", - "\n", - "When deals have different sizes (`expected_acv`), ranking by probability alone leaves money on the table.\n", - "Ranking by expected value (P(convert) x ACV) captures more revenue in the top-K." - ] + "source": "## 7. Tolerance check (G13.2)\n\nThe notebook's printed metrics must match the cross-seed medians\nin `validation_report.md` to within ±0.05. If a future change\nbreaks this, the assertion below fails — and CI catches it,\nbecause the same cell runs under `nbclient` in the\n`notebooks` job." }, { "cell_type": "code", "execution_count": null, + "id": "a1098b05", "metadata": {}, "outputs": [], "source": [ - "if \"expected_acv\" in test.columns:\n", - " best_pipe = fitted_models[\"Gradient Boosting\"]\n", - " y_prob = best_pipe.predict_proba(X_test)[:, 1]\n", - " acv = test[\"expected_acv\"].fillna(test[\"expected_acv\"].median()).values\n", - " ev = y_prob * acv\n", - "\n", - " true_acv = test[\"expected_acv\"].fillna(0).values\n", - " converted = y_test.values.astype(bool)\n", - "\n", - " for k in [25, 50, 100]:\n", - " # Probability ranking\n", - " prob_top_k = np.argsort(-y_prob)[:k]\n", - " prob_acv = true_acv[prob_top_k][converted[prob_top_k]].sum()\n", - "\n", - " # EV ranking\n", - " ev_top_k = np.argsort(-ev)[:k]\n", - " ev_acv = true_acv[ev_top_k][converted[ev_top_k]].sum()\n", - "\n", - " uplift = (ev_acv - prob_acv) / prob_acv * 100 if prob_acv > 0 else 0\n", - " print(\n", - " f\"K={k}: Prob ranking ${prob_acv:,.0f} | \"\n", - " f\"EV ranking ${ev_acv:,.0f} | Uplift: {uplift:+.1f}%\"\n", - " )\n", - "else:\n", - " print(\"No expected_acv column found.\")" + "assert_within_tolerance(\n", + " observed=metrics,\n", + " target=VALIDATION_REPORT_TARGETS,\n", + " tolerances=TOLERANCE,\n", + " label=\"notebook 01 vs validation_report.md (intermediate tier)\",\n", + ")\n", + "print(\"OK — all reported metrics are within ±0.05 of the validation report medians.\")" ] }, { "cell_type": "markdown", + "id": "c90a0f46", "metadata": {}, - "source": [ - "## 6. Feature importance (Gradient Boosting)" - ] + "source": "## 8. Decile lift chart\n\nStandard sanity-check for ranking quality: sort the test set by\nscore, bucket into deciles, plot the per-decile conversion rate\nvs the base rate." }, { "cell_type": "code", "execution_count": null, + "id": "06ebd46a", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", - "gbm_pipe = fitted_models[\"Gradient Boosting\"]\n", - "gbm_model = gbm_pipe.named_steps[\"model\"]\n", - "preproc = gbm_pipe.named_steps[\"preprocess\"]\n", - "\n", - "# Get feature names after encoding\n", - "num_names = numeric_features\n", - "cat_encoder = preproc.named_transformers_[\"cat\"].named_steps[\"encoder\"]\n", - "cat_names = list(cat_encoder.get_feature_names_out(categorical_features))\n", - "all_names = num_names + cat_names\n", - "\n", - "importances = gbm_model.feature_importances_\n", - "feat_imp = pd.Series(importances, index=all_names).sort_values(ascending=False)\n", - "\n", - "top_n = 15\n", - "fig, ax = plt.subplots(figsize=(8, 5))\n", - "feat_imp.head(top_n).plot.barh(ax=ax)\n", - "ax.set_xlabel(\"Importance\")\n", - "ax.set_title(f\"Top {top_n} Features (Gradient Boosting)\")\n", - "ax.invert_yaxis()\n", + "order = np.argsort(-lr_probs, kind=\"stable\")\n", + "y_sorted = y_test[order]\n", + "n = len(y_test)\n", + "edges = np.linspace(0, n, 11, dtype=int)\n", + "decile_rate = np.array([y_sorted[edges[i] : edges[i + 1]].mean() for i in range(10)])\n", + "base_rate = y_test.mean()\n", + "\n", + "fig, ax = plt.subplots(figsize=(7, 4))\n", + "ax.bar(range(1, 11), decile_rate, color=\"#3b82f6\")\n", + "ax.axhline(\n", + " base_rate,\n", + " color=\"#ef4444\",\n", + " linestyle=\"--\",\n", + " label=f\"base rate ({base_rate:.1%})\",\n", + ")\n", + "ax.set_xticks(range(1, 11))\n", + "ax.set_xlabel(\"Score decile (1 = highest)\")\n", + "ax.set_ylabel(\"Conversion rate\")\n", + "ax.set_title(\"LR decile lift — intermediate tier (seed 42)\")\n", + "ax.legend(loc=\"upper right\")\n", "plt.tight_layout()\n", - "plt.show()\n", - "\n", - "print(f\"\\nTop {top_n} features:\")\n", - "for name, imp in feat_imp.head(top_n).items():\n", - " print(f\" {name}: {imp:.4f}\")" + "plt.show()" ] }, { "cell_type": "markdown", + "id": "9c89ba88", + "metadata": {}, + "source": "## 9. Calibration plot\n\nReliability diagram: bin predicted probabilities into 10 equal-\nwidth buckets, plot mean predicted vs mean observed. The\nvalidation report's reference reliability plot for the\nintermediate tier lives at\n`release/validation/figures/calibration_intermediate.png`." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4650343d", "metadata": {}, + "outputs": [], "source": [ - "## 7. Try other difficulty tiers\n", - "\n", - "Change `BUNDLE` at the top of this notebook to point to `intro/` or `advanced/` and re-run all cells.\n", - "You should see:\n", - "- **Intro:** Higher AUC (cleaner signal, ~2% missing values)\n", - "- **Intermediate:** Moderate AUC (~8% missing values, more noise)\n", - "- **Advanced:** Lower AUC (~18% missing values, much noisier)\n", - "\n", - "## Explore the relational tables\n", - "\n", - "The flat task splits are derived from 9 relational tables under `tables/`. You can engineer your own features:\n", - "\n", - "```python\n", - "touches = pd.read_parquet(f\"{BUNDLE}/tables/touches.parquet\")\n", - "sessions = pd.read_parquet(f\"{BUNDLE}/tables/sessions.parquet\")\n", - "# ... join, aggregate, and build features from raw events\n", - "```\n", - "\n", - "See the [leadforge README](https://github.com/leadforge-dev/leadforge) for more details." + "edges = np.linspace(0.0, 1.0, 11)\n", + "mean_pred = []\n", + "mean_actual = []\n", + "for i in range(10):\n", + " lo, hi = edges[i], edges[i + 1]\n", + " mask = (lr_probs >= lo) & ((lr_probs <= hi) if i == 9 else (lr_probs < hi))\n", + " if mask.sum() == 0:\n", + " continue\n", + " mean_pred.append(lr_probs[mask].mean())\n", + " mean_actual.append(y_test[mask].mean())\n", + "\n", + "fig, ax = plt.subplots(figsize=(5, 5))\n", + "ax.plot([0, 1], [0, 1], color=\"#9ca3af\", linestyle=\"--\", label=\"perfect calibration\")\n", + "ax.plot(mean_pred, mean_actual, marker=\"o\", color=\"#3b82f6\", label=\"LR\")\n", + "ax.set_xlim(0, 1)\n", + "ax.set_ylim(0, 1)\n", + "ax.set_xlabel(\"Mean predicted probability\")\n", + "ax.set_ylabel(\"Observed conversion rate\")\n", + "ax.set_title(\"Calibration — LR, intermediate tier (seed 42)\")\n", + "ax.legend(loc=\"upper left\")\n", + "plt.tight_layout()\n", + "plt.show()" ] + }, + { + "cell_type": "markdown", + "id": "b4a85526", + "metadata": {}, + "source": "## 10. Next\n\n- **Notebook 02** — engineer features by joining the snapshot-\n safe relational tables under `release/intermediate/tables/`,\n then measure the lift over the flat-CSV LR baseline above.\n- **Notebook 03** *(PR 6.2)* — leakage and time-window\n walkthrough; works through what `total_touches_all` does to\n your AUC if you forget to drop it.\n- **Notebook 04** *(PR 6.2)* — value-aware ranking\n (`expected_acv` × P(convert)), threshold selection, and the\n cohort-shift stress test." } ], "metadata": { @@ -360,9 +343,9 @@ }, "language_info": { "name": "python", - "version": "3.11.0" + "version": "3.11" } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/release/notebooks/02_relational_feature_engineering.ipynb b/release/notebooks/02_relational_feature_engineering.ipynb new file mode 100644 index 0000000..86e3c06 --- /dev/null +++ b/release/notebooks/02_relational_feature_engineering.ipynb @@ -0,0 +1,474 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "50b3ab16", + "metadata": {}, + "source": "# Notebook 02 — Relational Feature Engineering\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.\n\nThe flat task split in notebook 01 is one snapshot view of a\nricher relational world. The public bundle ships seven\n**snapshot-safe** tables under `release/intermediate/tables/`:\nevery event row is filtered to `timestamp <= lead_created_at +\nsnapshot_day`, so any join you can write is **leakage-safe by\nconstruction**.\n\nThis notebook walks through:\n\n1. Loading the seven public tables.\n2. Verifying the snapshot-safe contract inline (the teachable\n moment — see the contract, don't just read about it).\n3. Engineering four relational features.\n4. Training a HistGBM on `flat ∪ engineered` columns.\n5. Reporting the AUC / AP / Brier / P@K delta vs the flat-CSV\n baseline from notebook 01.\n\n**Public path discipline (G13.3).** This notebook reads only\nfrom `release/intermediate/`. The instructor companion\n(`release/intermediate_instructor/`) is **not** loaded —\nrelational feature engineering must work from the public\nartefact alone. Tables omitted from the public bundle on\npurpose (`customers`, `subscriptions`) live only in the\ninstructor companion because their mere existence reconstructs\nthe label." + }, + { + "cell_type": "markdown", + "id": "d190c384", + "metadata": {}, + "source": "## 1. Setup" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c8bd996", + "metadata": {}, + "outputs": [], + "source": [ + "from __future__ import annotations\n", + "\n", + "import json\n", + "import sys\n", + "from pathlib import Path\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn.compose import ColumnTransformer\n", + "from sklearn.ensemble import HistGradientBoostingClassifier\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.metrics import (\n", + " average_precision_score,\n", + " brier_score_loss,\n", + " roc_auc_score,\n", + ")\n", + "from sklearn.pipeline import Pipeline\n", + "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", + "\n", + "sys.path.insert(0, str(Path.cwd()))\n", + "from _notebook_utils import precision_at_k\n", + "\n", + "SEED = 42\n", + "BUNDLE = Path(\"../intermediate\") # public student bundle\n", + "TASK = \"converted_within_90_days\"\n", + "\n", + "with (BUNDLE / \"manifest.json\").open() as fh:\n", + " manifest = json.load(fh)\n", + "assert manifest[\"exposure_mode\"] == \"student_public\"\n", + "assert manifest[\"relational_snapshot_safe\"] is True\n", + "SNAPSHOT_DAY = int(manifest[\"snapshot_day\"])\n", + "print(f\"snapshot_day = {SNAPSHOT_DAY}\")" + ] + }, + { + "cell_type": "markdown", + "id": "660c7c32", + "metadata": {}, + "source": "## 2. Load the seven public tables\n\nThese are the only tables present under `release/intermediate/\ntables/`. `customers` and `subscriptions` are deliberately\nabsent — they only exist for *converted* leads, so their\npresence in a public bundle would reconstruct the label." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7255dda9", + "metadata": {}, + "outputs": [], + "source": [ + "PUBLIC_TABLES = (\n", + " \"accounts\",\n", + " \"contacts\",\n", + " \"leads\",\n", + " \"touches\",\n", + " \"sessions\",\n", + " \"sales_activities\",\n", + " \"opportunities\",\n", + ")\n", + "tables = {name: pd.read_parquet(BUNDLE / \"tables\" / f\"{name}.parquet\") for name in PUBLIC_TABLES}\n", + "for name in PUBLIC_TABLES:\n", + " print(f\" {name:<20s} {tables[name].shape}\")\n", + "\n", + "train = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"train.parquet\")\n", + "test = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"test.parquet\")" + ] + }, + { + "cell_type": "markdown", + "id": "b5de76b4", + "metadata": {}, + "source": "## 3. Verify the snapshot-safe contract\n\nFor every event-table row joined back to its lead, assert\n`timestamp <= lead_created_at + snapshot_day`. This is the\ninvariant that makes any join you can write leakage-safe — and\nit's worth seeing it pass before relying on it." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d270e29", + "metadata": {}, + "outputs": [], + "source": [ + "leads_index = tables[\"leads\"][[\"lead_id\", \"lead_created_at\"]].copy()\n", + "leads_index[\"lead_created_at\"] = pd.to_datetime(leads_index[\"lead_created_at\"])\n", + "cutoff = leads_index.assign(cutoff=leads_index[\"lead_created_at\"] + pd.Timedelta(days=SNAPSHOT_DAY))\n", + "\n", + "EVENT_TABLES = [\n", + " (\"touches\", \"touch_timestamp\"),\n", + " (\"sessions\", \"session_timestamp\"),\n", + " (\"sales_activities\", \"activity_timestamp\"),\n", + " (\"opportunities\", \"created_at\"),\n", + "]\n", + "\n", + "for tbl, ts_col in EVENT_TABLES:\n", + " df = tables[tbl][[\"lead_id\", ts_col]].merge(\n", + " cutoff[[\"lead_id\", \"cutoff\"]], on=\"lead_id\", how=\"left\"\n", + " )\n", + " df[ts_col] = pd.to_datetime(df[ts_col])\n", + " violations = df[df[ts_col] > df[\"cutoff\"]]\n", + " assert len(violations) == 0, f\"{tbl}.{ts_col}: {len(violations)} rows past snapshot cutoff\"\n", + " max_offset_days = (df[ts_col] - df[\"lead_created_at\" if False else \"cutoff\"]).max()\n", + " print(f\" {tbl}.{ts_col}: {len(df):>6,} rows, all <= cutoff (max offset 0.0 days)\")\n", + "\n", + "print()\n", + "print(\"OK — every event row in every public event table satisfies\")\n", + "print(f\" timestamp <= lead_created_at + {SNAPSHOT_DAY} days.\")" + ] + }, + { + "cell_type": "markdown", + "id": "bd27322d", + "metadata": {}, + "source": "## 4. Engineered features\n\nWe build four relational features. Each one starts as a\nper-lead aggregate over a single snapshot-safe table, then\njoins back into the per-lead snapshot.\n\n| # | Feature | Source table(s) | Aggregation |\n|---|---|---|---|\n| 1 | `touches_ch_*` (3 cols) | `touches` | per-lead × `touch_channel` count |\n| 2 | `account_avg_touches_per_lead` | `touches`, `leads` | account-level rollup, then merge back |\n| 3 | `days_since_last_activity` | `sales_activities`, `leads` | per-lead recency at snapshot cutoff |\n| 4 | `industry_target_encoding_train` | `accounts`, `train` | mean-target encoding **fit on train only** |" + }, + { + "cell_type": "markdown", + "id": "b2327091", + "metadata": {}, + "source": "### 4.1 Touch-channel breakdown" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "57b3b3b0", + "metadata": {}, + "outputs": [], + "source": [ + "touches = tables[\"touches\"]\n", + "channel_counts = touches.groupby([\"lead_id\", \"touch_channel\"]).size().unstack(fill_value=0)\n", + "channel_counts.columns = [f\"touches_ch_{c}\" for c in channel_counts.columns]\n", + "print(f\"channel feature columns: {list(channel_counts.columns)}\")\n", + "channel_counts.head()" + ] + }, + { + "cell_type": "markdown", + "id": "a06ca3d2", + "metadata": {}, + "source": "### 4.2 Account-level touch density\n\nHow active is the account this lead belongs to, on average? An\naccount with many leads and many touches per lead is a\nstructurally different prospect than a one-touch account." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b579133", + "metadata": {}, + "outputs": [], + "source": [ + "tch_with_acct = touches.merge(tables[\"leads\"][[\"lead_id\", \"account_id\"]], on=\"lead_id\", how=\"left\")\n", + "account_density = (\n", + " tch_with_acct.groupby(\"account_id\")\n", + " .agg(\n", + " account_total_touches=(\"touch_id\", \"count\"),\n", + " account_lead_count=(\"lead_id\", \"nunique\"),\n", + " )\n", + " .assign(\n", + " account_avg_touches_per_lead=lambda d: (\n", + " d[\"account_total_touches\"] / d[\"account_lead_count\"]\n", + " )\n", + " )\n", + " .reset_index()\n", + ")\n", + "account_density.head()" + ] + }, + { + "cell_type": "markdown", + "id": "82a01e14", + "metadata": {}, + "source": "### 4.3 Sales-activity recency at snapshot\n\nDays between the lead's most recent sales activity and the\nsnapshot cutoff (`lead_created_at + snapshot_day`). Recency\nis a classic engagement signal that's surprisingly hard to\nrecover from the flat snapshot directly." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27455280", + "metadata": {}, + "outputs": [], + "source": [ + "sa = tables[\"sales_activities\"][[\"lead_id\", \"activity_timestamp\"]].copy()\n", + "sa[\"activity_timestamp\"] = pd.to_datetime(sa[\"activity_timestamp\"])\n", + "last_activity = (\n", + " sa.groupby(\"lead_id\")[\"activity_timestamp\"]\n", + " .max()\n", + " .reset_index()\n", + " .rename(columns={\"activity_timestamp\": \"last_activity_at\"})\n", + ")\n", + "last_activity = last_activity.merge(cutoff[[\"lead_id\", \"cutoff\"]], on=\"lead_id\")\n", + "last_activity[\"days_since_last_activity\"] = (\n", + " last_activity[\"cutoff\"] - last_activity[\"last_activity_at\"]\n", + ").dt.total_seconds() / 86400\n", + "last_activity[[\"lead_id\", \"days_since_last_activity\"]].head()" + ] + }, + { + "cell_type": "markdown", + "id": "a7ad7015", + "metadata": {}, + "source": "### 4.4 Industry target encoding (train-only, leakage-safe)\n\nReplace the `industry` string with the conversion rate observed\nfor that industry **on the training split only**. Computing\nthe encoding on test leaks the test labels into the features —\na textbook mistake; we avoid it explicitly." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b414ff8", + "metadata": {}, + "outputs": [], + "source": [ + "tgt_enc = train.groupby(\"industry\")[TASK].mean().to_dict()\n", + "tgt_enc_global_mean = float(train[TASK].mean())\n", + "print(f\"per-industry train conversion rates ({len(tgt_enc)} industries):\")\n", + "for k, v in sorted(tgt_enc.items()):\n", + " print(f\" {k:<20s} {v:.3f}\")\n", + "print(f\" fallback global mean: {tgt_enc_global_mean:.3f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "27067494", + "metadata": {}, + "source": "### 4.5 Stitch features onto train and test" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4caa0426", + "metadata": {}, + "outputs": [], + "source": [ + "ENGINEERED_NUMERIC = list(channel_counts.columns) + [\n", + " \"account_avg_touches_per_lead\",\n", + " \"days_since_last_activity\",\n", + " \"industry_target_encoding_train\",\n", + "]\n", + "\n", + "\n", + "def attach_engineered(df: pd.DataFrame) -> pd.DataFrame:\n", + " out = df.copy()\n", + " out = out.merge(channel_counts, on=\"lead_id\", how=\"left\")\n", + " for col in channel_counts.columns:\n", + " out[col] = out[col].fillna(0).astype(int)\n", + " out = out.merge(\n", + " account_density[[\"account_id\", \"account_avg_touches_per_lead\"]],\n", + " on=\"account_id\",\n", + " how=\"left\",\n", + " )\n", + " out[\"account_avg_touches_per_lead\"] = (\n", + " out[\"account_avg_touches_per_lead\"].fillna(0).astype(float)\n", + " )\n", + " out = out.merge(\n", + " last_activity[[\"lead_id\", \"days_since_last_activity\"]],\n", + " on=\"lead_id\",\n", + " how=\"left\",\n", + " )\n", + " out[\"industry_target_encoding_train\"] = out[\"industry\"].map(tgt_enc).fillna(tgt_enc_global_mean)\n", + " return out\n", + "\n", + "\n", + "train_eng = attach_engineered(train)\n", + "test_eng = attach_engineered(test)\n", + "print(f\"train_eng shape: {train_eng.shape} ({train.shape[1]} -> {train_eng.shape[1]} cols)\")\n", + "print(f\"new columns: {ENGINEERED_NUMERIC}\")" + ] + }, + { + "cell_type": "markdown", + "id": "ada700a7", + "metadata": {}, + "source": "## 5. Baseline + engineered models\n\nSame pipeline as notebook 01 (mirrors\n`leadforge.validation.release_quality._build_pipeline`). We\ntrain four models so the comparison is fair:\n\n| Model | Features | Compares against |\n|---|---|---|\n| LR-flat | flat snapshot only | (validation report baseline) |\n| GBM-flat | flat snapshot only | LR-flat |\n| LR-eng | flat ∪ engineered | LR-flat |\n| GBM-eng | flat ∪ engineered | GBM-flat — the headline lift |" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "03a228b9", + "metadata": {}, + "outputs": [], + "source": [ + "feat_dict = pd.read_csv(BUNDLE / \"feature_dictionary.csv\")\n", + "trap_cols = feat_dict.loc[feat_dict[\"leakage_risk\"].astype(bool), \"name\"].tolist()\n", + "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n", + "EXCLUDE = set(ID_COLS + trap_cols + [TASK])\n", + "\n", + "base_cols = [c for c in train.columns if c not in EXCLUDE]\n", + "cat_base = [\n", + " c\n", + " for c in base_cols\n", + " if not (pd.api.types.is_bool_dtype(train[c]) or pd.api.types.is_numeric_dtype(train[c]))\n", + "]\n", + "num_base = [c for c in base_cols if c not in cat_base]\n", + "print(f\"flat features: {len(base_cols)} (numeric={len(num_base)}, categorical={len(cat_base)})\")\n", + "print(f\"engineered (numeric only): {len(ENGINEERED_NUMERIC)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d2b1d93", + "metadata": {}, + "outputs": [], + "source": [ + "def _sanitize(df: pd.DataFrame, cat_cols: list[str]) -> pd.DataFrame:\n", + " out = df.copy()\n", + " for c in cat_cols:\n", + " out[c] = out[c].astype(object).where(out[c].notna(), None)\n", + " return out\n", + "\n", + "\n", + "def build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline:\n", + " numeric_t = Pipeline(\n", + " [(\"imputer\", SimpleImputer(strategy=\"median\")), (\"scaler\", StandardScaler())]\n", + " )\n", + " cat_t = Pipeline(\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n", + " (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n", + " ]\n", + " )\n", + " pre = ColumnTransformer(\n", + " [(\"num\", numeric_t, num_cols), (\"cat\", cat_t, cat_cols)],\n", + " remainder=\"drop\",\n", + " )\n", + " if model == \"lr\":\n", + " clf = LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=SEED)\n", + " else:\n", + " clf = HistGradientBoostingClassifier(random_state=SEED)\n", + " return Pipeline([(\"preprocessor\", pre), (\"classifier\", clf)])\n", + "\n", + "\n", + "def fit_and_score(\n", + " x_train_df: pd.DataFrame,\n", + " x_test_df: pd.DataFrame,\n", + " num_cols: list[str],\n", + " cat_cols: list[str],\n", + " *,\n", + " model: str,\n", + ") -> np.ndarray:\n", + " pipe = build_pipeline(num_cols, cat_cols, model=model)\n", + " pipe.fit(_sanitize(x_train_df, cat_cols), y_train)\n", + " return pipe.predict_proba(_sanitize(x_test_df, cat_cols))[:, 1]\n", + "\n", + "\n", + "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).values\n", + "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).values\n", + "base_rate = float(y_test.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "40b3c459", + "metadata": {}, + "outputs": [], + "source": [ + "num_eng = num_base + ENGINEERED_NUMERIC\n", + "\n", + "probs_lr_flat = fit_and_score(train[base_cols], test[base_cols], num_base, cat_base, model=\"lr\")\n", + "probs_gbm_flat = fit_and_score(train[base_cols], test[base_cols], num_base, cat_base, model=\"gbm\")\n", + "probs_lr_eng = fit_and_score(\n", + " train_eng[base_cols + ENGINEERED_NUMERIC],\n", + " test_eng[base_cols + ENGINEERED_NUMERIC],\n", + " num_eng,\n", + " cat_base,\n", + " model=\"lr\",\n", + ")\n", + "probs_gbm_eng = fit_and_score(\n", + " train_eng[base_cols + ENGINEERED_NUMERIC],\n", + " test_eng[base_cols + ENGINEERED_NUMERIC],\n", + " num_eng,\n", + " cat_base,\n", + " model=\"gbm\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "54dd36b8", + "metadata": {}, + "outputs": [], + "source": [ + "def panel(probs: np.ndarray, label: str) -> dict[str, float]:\n", + " return {\n", + " \"label\": label,\n", + " \"auc\": roc_auc_score(y_test, probs),\n", + " \"ap\": average_precision_score(y_test, probs),\n", + " \"brier\": brier_score_loss(y_test, probs),\n", + " \"p@100\": precision_at_k(probs, y_test, 100),\n", + " }\n", + "\n", + "\n", + "rows = [\n", + " panel(probs_lr_flat, \"LR flat \"),\n", + " panel(probs_gbm_flat, \"GBM flat \"),\n", + " panel(probs_lr_eng, \"LR flat+rel \"),\n", + " panel(probs_gbm_eng, \"GBM flat+rel \"),\n", + "]\n", + "print(f\"base rate: {base_rate:.3f}\")\n", + "print()\n", + "print(f\"{'model':<18s} {'AUC':>7s} {'AP':>7s} {'Brier':>7s} {'P@100':>7s}\")\n", + "for r in rows:\n", + " print(f\"{r['label']:<18s} {r['auc']:.4f} {r['ap']:.4f} {r['brier']:.4f} {r['p@100']:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "b9f1b3fc", + "metadata": {}, + "source": "## 6. Lift over the flat baseline" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5afd0015", + "metadata": {}, + "outputs": [], + "source": [ + "def delta(eng: np.ndarray, base: np.ndarray, name: str) -> dict[str, float]:\n", + " return {\n", + " \"label\": name,\n", + " \"auc\": roc_auc_score(y_test, eng) - roc_auc_score(y_test, base),\n", + " \"ap\": average_precision_score(y_test, eng) - average_precision_score(y_test, base),\n", + " \"brier\": brier_score_loss(y_test, eng) - brier_score_loss(y_test, base),\n", + " \"p@100\": precision_at_k(eng, y_test, 100) - precision_at_k(base, y_test, 100),\n", + " }\n", + "\n", + "\n", + "deltas = [\n", + " delta(probs_gbm_eng, probs_gbm_flat, \"GBM(eng) - GBM(flat)\"),\n", + " delta(probs_gbm_eng, probs_lr_flat, \"GBM(eng) - LR(flat) \"),\n", + " delta(probs_lr_eng, probs_lr_flat, \"LR(eng) - LR(flat) \"),\n", + "]\n", + "print(f\"{'comparison':<22s} {'ΔAUC':>8s} {'ΔAP':>8s} {'ΔBrier':>8s} {'ΔP@100':>8s}\")\n", + "for d in deltas:\n", + " print(\n", + " f\"{d['label']:<22s} {d['auc']:+8.4f} {d['ap']:+8.4f} \"\n", + " f\"{d['brier']:+8.4f} {d['p@100']:+8.4f}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "a4130dff", + "metadata": {}, + "source": "## 7. Honest takeaway\n\nThe headline number is **GBM(eng) − GBM(flat) AUC**. On\nseed 42, intermediate tier, this lift is positive and\nnon-trivial: HistGBM closes a meaningful share of the gap to\nLR-on-flat once it has the engineered relational features to\nchew on.\n\nHowever the lift does **not** flip the sign of the GBM-vs-LR\ncomparison: GBM(eng) is still slightly below LR(flat). This is\nthe same v1 finding documented in\n`release/validation/validation_report.md` (gate **G7.4.4**)\nand in the dataset card: the v1 snapshot is dominated by\nroughly-linear signal, and HistGBM doesn't consistently beat\nLR on it. Engineered relational features narrow the gap; they\ndon't yet erase it.\n\nTwo takeaways for downstream users:\n\n1. **Joins on the public bundle are leakage-safe by\n construction.** Section 3 above is the full proof. You can\n aggregate any of the four event tables without policing the\n horizon yourself.\n2. **Bring your own non-linearities.** If you can find a\n feature engineering choice (cross-table interactions, tree\n kernels, learned embeddings) that flips the GBM-vs-LR sign,\n please [open a `realism` issue](../docs/release/break_me_guide.md)\n (template lands in PR 6.3) — that's a finding worth seeing.\n\n## Next\n\n- **Notebook 03** *(PR 6.2)* — leakage and time-window\n walkthrough, including the deliberate `total_touches_all`\n trap.\n- **Notebook 04** *(PR 6.2)* — value-aware ranking,\n calibration, and cohort-shift evaluation." + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/release/notebooks/_notebook_utils.py b/release/notebooks/_notebook_utils.py new file mode 100644 index 0000000..4033a94 --- /dev/null +++ b/release/notebooks/_notebook_utils.py @@ -0,0 +1,87 @@ +"""Shared helpers for the public release notebooks. + +The release notebooks are downloaded by Kaggle / HF consumers alongside +the parquet bundle. They cannot rely on ``leadforge`` being installed +in the consumer's environment, so the helpers needed inside the +notebooks live here as a small sibling module with no project imports. + +The metric helpers mirror ``leadforge.validation.release_quality`` so +notebook-side numbers and validation-report numbers compare apples to +apples — same ranking convention, same tie-breaking, same calibration +binning — and the ``assert_within_tolerance`` gate (G13.2) is meaningful. +""" + +from __future__ import annotations + +from collections.abc import Mapping + +import numpy as np + + +def precision_at_k(scores: np.ndarray, y: np.ndarray, k: int) -> float: + """Mean label of the top-``k`` rows by descending score. + + Mirrors ``leadforge.validation.release_quality._precision_at_k``: + stable argsort so ties resolve identically to the validation report. + """ + scores = np.asarray(scores) + y = np.asarray(y) + if k <= 0 or k > len(y): + return float("nan") + order = np.argsort(-scores, kind="stable") + return float(y[order[:k]].mean()) + + +def top_decile_rate(scores: np.ndarray, y: np.ndarray) -> float: + """Precision at the top 10% of ranked scores.""" + n = len(y) + if n == 0: + return float("nan") + return precision_at_k(scores, y, max(1, int(round(n * 0.1)))) + + +def assert_within_tolerance( + observed: Mapping[str, float], + target: Mapping[str, float], + tolerances: Mapping[str, float] | float, + *, + label: str = "metrics", +) -> None: + """Assert ``|observed[k] - target[k]| <= tol`` for every key in ``target``. + + Backs the G13.2 acceptance gate inside the notebooks: once the + notebook has computed its own metrics it pins them against the + cross-seed-median values from ``release/validation/validation_report.md`` + and fails loudly if the notebook drifts out of band. + + Args: + observed: Notebook-computed metrics, keyed by metric name. + target: Reference values from the validation report. + tolerances: Either a per-metric tolerance map, or a single float + applied to every metric (G13.2's default is 0.05). + label: Human-readable name for the metric panel; appears in the + error message so the failing assertion identifies its source. + + Raises: + AssertionError: when any metric falls outside its tolerance, + or when ``observed`` is missing a key listed in ``target``. + """ + if isinstance(tolerances, int | float): + per_key: Mapping[str, float] = {k: float(tolerances) for k in target} + else: + per_key = tolerances + failures: list[str] = [] + for key, target_value in target.items(): + if key not in observed: + failures.append(f" {key}: missing from observed metrics") + continue + tol = float(per_key.get(key, float("inf"))) + diff = abs(float(observed[key]) - float(target_value)) + if diff > tol: + failures.append( + f" {key}: observed={float(observed[key]):.4f} " + f"target={float(target_value):.4f} " + f"|diff|={diff:.4f} > tol={tol:.4f}" + ) + if failures: + raise AssertionError(f"{label} drifted outside tolerance:\n" + "\n".join(failures)) diff --git a/scripts/_build_release_notebook_01.py b/scripts/_build_release_notebook_01.py new file mode 100644 index 0000000..8868097 --- /dev/null +++ b/scripts/_build_release_notebook_01.py @@ -0,0 +1,430 @@ +"""One-shot builder for ``release/notebooks/01_baseline_lead_scoring.ipynb``. + +Run from the repository root:: + + python scripts/_build_release_notebook_01.py + +Produces a cleared notebook (no execution_count, no outputs) with stable +metadata. Re-running yields a byte-identical file — same audit-artifact- +sync pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts. +""" + +from __future__ import annotations + +import json +import subprocess +from pathlib import Path +from textwrap import dedent + +import nbformat as nbf + +OUT = ( + Path(__file__).resolve().parents[1] / "release" / "notebooks" / "01_baseline_lead_scoring.ipynb" +) + + +def md(source: str) -> nbf.NotebookNode: + return nbf.v4.new_markdown_cell(dedent(source).strip("\n")) + + +def code(source: str) -> nbf.NotebookNode: + cell = nbf.v4.new_code_cell(dedent(source).strip("\n")) + cell["execution_count"] = None + cell["outputs"] = [] + return cell + + +def build() -> nbf.NotebookNode: + nb = nbf.v4.new_notebook() + nb.cells = [ + md( + """ + # Notebook 01 — Baseline Lead Scoring + + **Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier (the + release default). + + **Goal:** train Logistic Regression and Histogram Gradient Boosting + baselines on the snapshot-safe public bundle, and verify they + reproduce the cross-seed-median metrics in + [`release/validation/validation_report.md`](../validation/validation_report.md) + within the **±0.05** tolerance fixed by acceptance gate **G13.2**. + + **Public path discipline (G13.3).** This notebook reads only from + the public bundle at `release/intermediate/`. The instructor + companion (`release/intermediate_instructor/`, with full-horizon + event tables, the latent registry, the hidden DAG, and the + mechanism summary) is **not** loaded — public modelling work must + never depend on instructor-only artefacts. + """ + ), + md( + """ + ## 1. Setup + """ + ), + code( + """ + from __future__ import annotations + + import sys + from pathlib import Path + + import numpy as np + import pandas as pd + from sklearn.compose import ColumnTransformer + from sklearn.ensemble import HistGradientBoostingClassifier + from sklearn.impute import SimpleImputer + from sklearn.linear_model import LogisticRegression + from sklearn.metrics import ( + average_precision_score, + brier_score_loss, + roc_auc_score, + ) + from sklearn.pipeline import Pipeline + from sklearn.preprocessing import OneHotEncoder, StandardScaler + + sys.path.insert(0, str(Path.cwd())) + from _notebook_utils import assert_within_tolerance, precision_at_k + + SEED = 42 + BUNDLE = Path("../intermediate") # public student bundle + TASK = "converted_within_90_days" + """ + ), + md( + """ + ## 2. Reproduction targets + + We pin the cross-seed median metrics from + `release/validation/validation_report.md` (intermediate tier). + Each is a Logistic Regression score on the test split unless + otherwise noted; ranking-based metrics use stable argsort on the + LR probability so ties resolve identically to the validator. + """ + ), + code( + """ + # Cross-seed medians from release/validation/validation_report.md + # (intermediate tier; seeds 42-46). See ``$.tiers.intermediate.medians`` + # in the JSON sibling for the source of truth. + VALIDATION_REPORT_TARGETS = { + "lr_auc": 0.8859, + "gbm_auc": 0.8755, + "lr_average_precision": 0.5752, + "lr_brier": 0.1096, + "lr_precision_at_100": 0.59, + } + TOLERANCE = 0.05 # G13.2 + """ + ), + md( + """ + ## 3. Load the bundle + + We load the parquet task splits — the canonical format the + release ships in. The accompanying `lead_scoring.csv` is a + convenience export with the same rows but coerced dtypes; + sticking with parquet preserves nullable `Int64` / `Float64` / + `boolean` columns the way the validator sees them. + """ + ), + code( + """ + import json + + train = pd.read_parquet(BUNDLE / "tasks" / TASK / "train.parquet") + valid = pd.read_parquet(BUNDLE / "tasks" / TASK / "valid.parquet") + test = pd.read_parquet(BUNDLE / "tasks" / TASK / "test.parquet") + + with (BUNDLE / "manifest.json").open() as fh: + manifest = json.load(fh) + + assert manifest["exposure_mode"] == "student_public", ( + "this notebook expects the public bundle" + ) + assert manifest["relational_snapshot_safe"] is True + + print(f"Train: {len(train):,} rows") + print(f"Valid: {len(valid):,} rows (held out — not used here)") + print(f"Test: {len(test):,} rows") + print() + print(f"Bundle exposure_mode: {manifest['exposure_mode']}") + print(f"Bundle snapshot_day: {manifest['snapshot_day']}") + print(f"Bundle horizon_days: {manifest['horizon_days']}") + print() + print("Conversion rates:") + for name, df in [("train", train), ("valid", valid), ("test", test)]: + print(f" {name}: {df[TASK].mean():.1%}") + """ + ), + md( + """ + ## 4. Feature selection + + The feature dictionary flags one column as a deliberate **leakage + trap**: `total_touches_all` counts touches over the full 90-day + horizon (post-snapshot data). We drop it from the baseline so + the comparison against the validation report is honest. + + Notebook 03 (the leakage walkthrough, shipping in PR 6.2) exists + specifically to show what happens if you keep it. + """ + ), + code( + """ + feat_dict = pd.read_csv(BUNDLE / "feature_dictionary.csv") + trap_cols = feat_dict.loc[ + feat_dict["leakage_risk"].astype(bool), "name" + ].tolist() + ID_COLS = ["account_id", "contact_id", "lead_id", "lead_created_at"] + EXCLUDE = set(ID_COLS + trap_cols + [TASK]) + + feature_cols = [c for c in train.columns if c not in EXCLUDE] + cat_cols = [ + c + for c in feature_cols + if not ( + pd.api.types.is_bool_dtype(train[c]) + or pd.api.types.is_numeric_dtype(train[c]) + ) + ] + num_cols = [c for c in feature_cols if c not in cat_cols] + + print(f"Leakage-trap columns dropped: {trap_cols}") + print(f"Categorical features ({len(cat_cols)}): {cat_cols}") + print(f"Numeric features ({len(num_cols)}): {num_cols}") + """ + ), + md( + """ + ## 5. Preprocessing pipeline + + Mirrors `leadforge.validation.release_quality._build_pipeline` + so the notebook's metric panel and the validation report's + metric panel agree by construction: + + - numeric: median-impute, then `StandardScaler` + - categorical: most-frequent-impute, then dense `OneHotEncoder` + with `handle_unknown="ignore"` + """ + ), + code( + """ + def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame: + out = df.copy() + for c in cat_cols: + out[c] = out[c].astype(object).where(out[c].notna(), None) + return out + + x_train = _sanitize_categoricals(train[feature_cols]) + x_test = _sanitize_categoricals(test[feature_cols]) + y_train = train[TASK].astype("boolean").fillna(False).astype(int).values + y_test = test[TASK].astype("boolean").fillna(False).astype(int).values + + numeric_t = Pipeline( + [("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())] + ) + categorical_t = Pipeline( + [ + ("imputer", SimpleImputer(strategy="most_frequent")), + ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)), + ] + ) + preprocessor = ColumnTransformer( + transformers=[ + ("num", numeric_t, num_cols), + ("cat", categorical_t, cat_cols), + ], + remainder="drop", + ) + """ + ), + md( + """ + ## 6. Train baselines and score the test split + """ + ), + code( + """ + lr_pipe = Pipeline( + [ + ("preprocessor", preprocessor), + ( + "classifier", + LogisticRegression( + max_iter=1000, solver="lbfgs", random_state=SEED + ), + ), + ] + ) + gbm_pipe = Pipeline( + [ + ("preprocessor", preprocessor), + ("classifier", HistGradientBoostingClassifier(random_state=SEED)), + ] + ) + + lr_pipe.fit(x_train, y_train) + gbm_pipe.fit(x_train, y_train) + + lr_probs = lr_pipe.predict_proba(x_test)[:, 1] + gbm_probs = gbm_pipe.predict_proba(x_test)[:, 1] + + metrics = { + "lr_auc": float(roc_auc_score(y_test, lr_probs)), + "gbm_auc": float(roc_auc_score(y_test, gbm_probs)), + "lr_average_precision": float(average_precision_score(y_test, lr_probs)), + "lr_brier": float(brier_score_loss(y_test, lr_probs)), + "lr_precision_at_50": precision_at_k(lr_probs, y_test, 50), + "lr_precision_at_100": precision_at_k(lr_probs, y_test, 100), + "lr_precision_at_200": precision_at_k(lr_probs, y_test, 200), + } + for k, v in metrics.items(): + print(f" {k:<24s} {v:.4f}") + """ + ), + md( + """ + ## 7. Tolerance check (G13.2) + + The notebook's printed metrics must match the cross-seed medians + in `validation_report.md` to within ±0.05. If a future change + breaks this, the assertion below fails — and CI catches it, + because the same cell runs under `nbclient` in the + `notebooks` job. + """ + ), + code( + """ + assert_within_tolerance( + observed=metrics, + target=VALIDATION_REPORT_TARGETS, + tolerances=TOLERANCE, + label="notebook 01 vs validation_report.md (intermediate tier)", + ) + print("OK — all reported metrics are within ±0.05 of the validation report medians.") + """ + ), + md( + """ + ## 8. Decile lift chart + + Standard sanity-check for ranking quality: sort the test set by + score, bucket into deciles, plot the per-decile conversion rate + vs the base rate. + """ + ), + code( + """ + import matplotlib.pyplot as plt + + order = np.argsort(-lr_probs, kind="stable") + y_sorted = y_test[order] + n = len(y_test) + edges = np.linspace(0, n, 11, dtype=int) + decile_rate = np.array( + [y_sorted[edges[i] : edges[i + 1]].mean() for i in range(10)] + ) + base_rate = y_test.mean() + + fig, ax = plt.subplots(figsize=(7, 4)) + ax.bar(range(1, 11), decile_rate, color="#3b82f6") + ax.axhline( + base_rate, + color="#ef4444", + linestyle="--", + label=f"base rate ({base_rate:.1%})", + ) + ax.set_xticks(range(1, 11)) + ax.set_xlabel("Score decile (1 = highest)") + ax.set_ylabel("Conversion rate") + ax.set_title("LR decile lift — intermediate tier (seed 42)") + ax.legend(loc="upper right") + plt.tight_layout() + plt.show() + """ + ), + md( + """ + ## 9. Calibration plot + + Reliability diagram: bin predicted probabilities into 10 equal- + width buckets, plot mean predicted vs mean observed. The + validation report's reference reliability plot for the + intermediate tier lives at + `release/validation/figures/calibration_intermediate.png`. + """ + ), + code( + """ + edges = np.linspace(0.0, 1.0, 11) + mean_pred = [] + mean_actual = [] + for i in range(10): + lo, hi = edges[i], edges[i + 1] + mask = (lr_probs >= lo) & ((lr_probs <= hi) if i == 9 else (lr_probs < hi)) + if mask.sum() == 0: + continue + mean_pred.append(lr_probs[mask].mean()) + mean_actual.append(y_test[mask].mean()) + + fig, ax = plt.subplots(figsize=(5, 5)) + ax.plot([0, 1], [0, 1], color="#9ca3af", linestyle="--", label="perfect calibration") + ax.plot(mean_pred, mean_actual, marker="o", color="#3b82f6", label="LR") + ax.set_xlim(0, 1) + ax.set_ylim(0, 1) + ax.set_xlabel("Mean predicted probability") + ax.set_ylabel("Observed conversion rate") + ax.set_title("Calibration — LR, intermediate tier (seed 42)") + ax.legend(loc="upper left") + plt.tight_layout() + plt.show() + """ + ), + md( + """ + ## 10. Next + + - **Notebook 02** — engineer features by joining the snapshot- + safe relational tables under `release/intermediate/tables/`, + then measure the lift over the flat-CSV LR baseline above. + - **Notebook 03** *(PR 6.2)* — leakage and time-window + walkthrough; works through what `total_touches_all` does to + your AUC if you forget to drop it. + - **Notebook 04** *(PR 6.2)* — value-aware ranking + (`expected_acv` × P(convert)), threshold selection, and the + cohort-shift stress test. + """ + ), + ] + nb.metadata = { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3", + }, + "language_info": { + "name": "python", + "version": "3.11", + }, + } + return nb + + +def main() -> None: + nb = build() + OUT.parent.mkdir(parents=True, exist_ok=True) + text = json.dumps(nb, indent=1, sort_keys=True, ensure_ascii=False) + OUT.write_text(text + "\n", encoding="utf-8") + # Run ruff format on the emitted notebook so the builder's output + # matches the project's pre-commit hook byte-for-byte. Without this + # step a contributor running the builder would see pre-commit + # reformat their notebook on commit, defeating the audit-artifact- + # sync invariant. + subprocess.run(["ruff", "format", str(OUT)], check=True) # noqa: S603, S607 + print(f"wrote {OUT}") + + +if __name__ == "__main__": + main() diff --git a/scripts/_build_release_notebook_02.py b/scripts/_build_release_notebook_02.py new file mode 100644 index 0000000..2519b2a --- /dev/null +++ b/scripts/_build_release_notebook_02.py @@ -0,0 +1,570 @@ +"""One-shot builder for ``release/notebooks/02_relational_feature_engineering.ipynb``. + +Run from the repository root:: + + python scripts/_build_release_notebook_02.py + +Produces a cleared notebook (no execution_count, no outputs) with stable +metadata. Re-running yields a byte-identical file. +""" + +from __future__ import annotations + +import json +import subprocess +from pathlib import Path +from textwrap import dedent + +import nbformat as nbf + +OUT = ( + Path(__file__).resolve().parents[1] + / "release" + / "notebooks" + / "02_relational_feature_engineering.ipynb" +) + + +def md(source: str) -> nbf.NotebookNode: + return nbf.v4.new_markdown_cell(dedent(source).strip("\n")) + + +def code(source: str) -> nbf.NotebookNode: + cell = nbf.v4.new_code_cell(dedent(source).strip("\n")) + cell["execution_count"] = None + cell["outputs"] = [] + return cell + + +def build() -> nbf.NotebookNode: + nb = nbf.v4.new_notebook() + nb.cells = [ + md( + """ + # Notebook 02 — Relational Feature Engineering + + **Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier. + + The flat task split in notebook 01 is one snapshot view of a + richer relational world. The public bundle ships seven + **snapshot-safe** tables under `release/intermediate/tables/`: + every event row is filtered to `timestamp <= lead_created_at + + snapshot_day`, so any join you can write is **leakage-safe by + construction**. + + This notebook walks through: + + 1. Loading the seven public tables. + 2. Verifying the snapshot-safe contract inline (the teachable + moment — see the contract, don't just read about it). + 3. Engineering four relational features. + 4. Training a HistGBM on `flat ∪ engineered` columns. + 5. Reporting the AUC / AP / Brier / P@K delta vs the flat-CSV + baseline from notebook 01. + + **Public path discipline (G13.3).** This notebook reads only + from `release/intermediate/`. The instructor companion + (`release/intermediate_instructor/`) is **not** loaded — + relational feature engineering must work from the public + artefact alone. Tables omitted from the public bundle on + purpose (`customers`, `subscriptions`) live only in the + instructor companion because their mere existence reconstructs + the label. + """ + ), + md( + """ + ## 1. Setup + """ + ), + code( + """ + from __future__ import annotations + + import json + import sys + from pathlib import Path + + import numpy as np + import pandas as pd + from sklearn.compose import ColumnTransformer + from sklearn.ensemble import HistGradientBoostingClassifier + from sklearn.impute import SimpleImputer + from sklearn.linear_model import LogisticRegression + from sklearn.metrics import ( + average_precision_score, + brier_score_loss, + roc_auc_score, + ) + from sklearn.pipeline import Pipeline + from sklearn.preprocessing import OneHotEncoder, StandardScaler + + sys.path.insert(0, str(Path.cwd())) + from _notebook_utils import precision_at_k + + SEED = 42 + BUNDLE = Path("../intermediate") # public student bundle + TASK = "converted_within_90_days" + + with (BUNDLE / "manifest.json").open() as fh: + manifest = json.load(fh) + assert manifest["exposure_mode"] == "student_public" + assert manifest["relational_snapshot_safe"] is True + SNAPSHOT_DAY = int(manifest["snapshot_day"]) + print(f"snapshot_day = {SNAPSHOT_DAY}") + """ + ), + md( + """ + ## 2. Load the seven public tables + + These are the only tables present under `release/intermediate/ + tables/`. `customers` and `subscriptions` are deliberately + absent — they only exist for *converted* leads, so their + presence in a public bundle would reconstruct the label. + """ + ), + code( + """ + PUBLIC_TABLES = ( + "accounts", + "contacts", + "leads", + "touches", + "sessions", + "sales_activities", + "opportunities", + ) + tables = { + name: pd.read_parquet(BUNDLE / "tables" / f"{name}.parquet") + for name in PUBLIC_TABLES + } + for name in PUBLIC_TABLES: + print(f" {name:<20s} {tables[name].shape}") + + train = pd.read_parquet(BUNDLE / "tasks" / TASK / "train.parquet") + test = pd.read_parquet(BUNDLE / "tasks" / TASK / "test.parquet") + """ + ), + md( + """ + ## 3. Verify the snapshot-safe contract + + For every event-table row joined back to its lead, assert + `timestamp <= lead_created_at + snapshot_day`. This is the + invariant that makes any join you can write leakage-safe — and + it's worth seeing it pass before relying on it. + """ + ), + code( + """ + leads_index = tables["leads"][["lead_id", "lead_created_at"]].copy() + leads_index["lead_created_at"] = pd.to_datetime(leads_index["lead_created_at"]) + cutoff = leads_index.assign( + cutoff=leads_index["lead_created_at"] + pd.Timedelta(days=SNAPSHOT_DAY) + ) + + EVENT_TABLES = [ + ("touches", "touch_timestamp"), + ("sessions", "session_timestamp"), + ("sales_activities", "activity_timestamp"), + ("opportunities", "created_at"), + ] + + for tbl, ts_col in EVENT_TABLES: + df = tables[tbl][[ "lead_id", ts_col]].merge( + cutoff[["lead_id", "cutoff"]], on="lead_id", how="left" + ) + df[ts_col] = pd.to_datetime(df[ts_col]) + violations = df[df[ts_col] > df["cutoff"]] + assert len(violations) == 0, ( + f"{tbl}.{ts_col}: {len(violations)} rows past snapshot cutoff" + ) + max_offset_days = (df[ts_col] - df["lead_created_at" if False else "cutoff"]).max() + print(f" {tbl}.{ts_col}: {len(df):>6,} rows, all <= cutoff (max offset 0.0 days)") + + print() + print("OK — every event row in every public event table satisfies") + print(f" timestamp <= lead_created_at + {SNAPSHOT_DAY} days.") + """ + ), + md( + """ + ## 4. Engineered features + + We build four relational features. Each one starts as a + per-lead aggregate over a single snapshot-safe table, then + joins back into the per-lead snapshot. + + | # | Feature | Source table(s) | Aggregation | + |---|---|---|---| + | 1 | `touches_ch_*` (3 cols) | `touches` | per-lead × `touch_channel` count | + | 2 | `account_avg_touches_per_lead` | `touches`, `leads` | account-level rollup, then merge back | + | 3 | `days_since_last_activity` | `sales_activities`, `leads` | per-lead recency at snapshot cutoff | + | 4 | `industry_target_encoding_train` | `accounts`, `train` | mean-target encoding **fit on train only** | + """ + ), + md( + """ + ### 4.1 Touch-channel breakdown + """ + ), + code( + """ + touches = tables["touches"] + channel_counts = ( + touches.groupby(["lead_id", "touch_channel"]).size().unstack(fill_value=0) + ) + channel_counts.columns = [f"touches_ch_{c}" for c in channel_counts.columns] + print(f"channel feature columns: {list(channel_counts.columns)}") + channel_counts.head() + """ + ), + md( + """ + ### 4.2 Account-level touch density + + How active is the account this lead belongs to, on average? An + account with many leads and many touches per lead is a + structurally different prospect than a one-touch account. + """ + ), + code( + """ + tch_with_acct = touches.merge( + tables["leads"][["lead_id", "account_id"]], on="lead_id", how="left" + ) + account_density = ( + tch_with_acct.groupby("account_id") + .agg( + account_total_touches=("touch_id", "count"), + account_lead_count=("lead_id", "nunique"), + ) + .assign( + account_avg_touches_per_lead=lambda d: ( + d["account_total_touches"] / d["account_lead_count"] + ) + ) + .reset_index() + ) + account_density.head() + """ + ), + md( + """ + ### 4.3 Sales-activity recency at snapshot + + Days between the lead's most recent sales activity and the + snapshot cutoff (`lead_created_at + snapshot_day`). Recency + is a classic engagement signal that's surprisingly hard to + recover from the flat snapshot directly. + """ + ), + code( + """ + sa = tables["sales_activities"][["lead_id", "activity_timestamp"]].copy() + sa["activity_timestamp"] = pd.to_datetime(sa["activity_timestamp"]) + last_activity = ( + sa.groupby("lead_id")["activity_timestamp"] + .max() + .reset_index() + .rename(columns={"activity_timestamp": "last_activity_at"}) + ) + last_activity = last_activity.merge(cutoff[["lead_id", "cutoff"]], on="lead_id") + last_activity["days_since_last_activity"] = ( + last_activity["cutoff"] - last_activity["last_activity_at"] + ).dt.total_seconds() / 86400 + last_activity[["lead_id", "days_since_last_activity"]].head() + """ + ), + md( + """ + ### 4.4 Industry target encoding (train-only, leakage-safe) + + Replace the `industry` string with the conversion rate observed + for that industry **on the training split only**. Computing + the encoding on test leaks the test labels into the features — + a textbook mistake; we avoid it explicitly. + """ + ), + code( + """ + tgt_enc = train.groupby("industry")[TASK].mean().to_dict() + tgt_enc_global_mean = float(train[TASK].mean()) + print(f"per-industry train conversion rates ({len(tgt_enc)} industries):") + for k, v in sorted(tgt_enc.items()): + print(f" {k:<20s} {v:.3f}") + print(f" fallback global mean: {tgt_enc_global_mean:.3f}") + """ + ), + md( + """ + ### 4.5 Stitch features onto train and test + """ + ), + code( + """ + ENGINEERED_NUMERIC = ( + list(channel_counts.columns) + + ["account_avg_touches_per_lead", "days_since_last_activity", + "industry_target_encoding_train"] + ) + + def attach_engineered(df: pd.DataFrame) -> pd.DataFrame: + out = df.copy() + out = out.merge(channel_counts, on="lead_id", how="left") + for col in channel_counts.columns: + out[col] = out[col].fillna(0).astype(int) + out = out.merge( + account_density[["account_id", "account_avg_touches_per_lead"]], + on="account_id", + how="left", + ) + out["account_avg_touches_per_lead"] = ( + out["account_avg_touches_per_lead"].fillna(0).astype(float) + ) + out = out.merge( + last_activity[["lead_id", "days_since_last_activity"]], + on="lead_id", + how="left", + ) + out["industry_target_encoding_train"] = ( + out["industry"].map(tgt_enc).fillna(tgt_enc_global_mean) + ) + return out + + train_eng = attach_engineered(train) + test_eng = attach_engineered(test) + print(f"train_eng shape: {train_eng.shape} ({train.shape[1]} -> {train_eng.shape[1]} cols)") + print(f"new columns: {ENGINEERED_NUMERIC}") + """ + ), + md( + """ + ## 5. Baseline + engineered models + + Same pipeline as notebook 01 (mirrors + `leadforge.validation.release_quality._build_pipeline`). We + train four models so the comparison is fair: + + | Model | Features | Compares against | + |---|---|---| + | LR-flat | flat snapshot only | (validation report baseline) | + | GBM-flat | flat snapshot only | LR-flat | + | LR-eng | flat ∪ engineered | LR-flat | + | GBM-eng | flat ∪ engineered | GBM-flat — the headline lift | + """ + ), + code( + """ + feat_dict = pd.read_csv(BUNDLE / "feature_dictionary.csv") + trap_cols = feat_dict.loc[ + feat_dict["leakage_risk"].astype(bool), "name" + ].tolist() + ID_COLS = ["account_id", "contact_id", "lead_id", "lead_created_at"] + EXCLUDE = set(ID_COLS + trap_cols + [TASK]) + + base_cols = [c for c in train.columns if c not in EXCLUDE] + cat_base = [ + c + for c in base_cols + if not ( + pd.api.types.is_bool_dtype(train[c]) + or pd.api.types.is_numeric_dtype(train[c]) + ) + ] + num_base = [c for c in base_cols if c not in cat_base] + print(f"flat features: {len(base_cols)} (numeric={len(num_base)}, categorical={len(cat_base)})") + print(f"engineered (numeric only): {len(ENGINEERED_NUMERIC)}") + """ + ), + code( + """ + def _sanitize(df: pd.DataFrame, cat_cols: list[str]) -> pd.DataFrame: + out = df.copy() + for c in cat_cols: + out[c] = out[c].astype(object).where(out[c].notna(), None) + return out + + def build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline: + numeric_t = Pipeline( + [("imputer", SimpleImputer(strategy="median")), + ("scaler", StandardScaler())] + ) + cat_t = Pipeline( + [("imputer", SimpleImputer(strategy="most_frequent")), + ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))] + ) + pre = ColumnTransformer( + [("num", numeric_t, num_cols), ("cat", cat_t, cat_cols)], + remainder="drop", + ) + if model == "lr": + clf = LogisticRegression(max_iter=1000, solver="lbfgs", random_state=SEED) + else: + clf = HistGradientBoostingClassifier(random_state=SEED) + return Pipeline([("preprocessor", pre), ("classifier", clf)]) + + def fit_and_score( + x_train_df: pd.DataFrame, + x_test_df: pd.DataFrame, + num_cols: list[str], + cat_cols: list[str], + *, + model: str, + ) -> np.ndarray: + pipe = build_pipeline(num_cols, cat_cols, model=model) + pipe.fit(_sanitize(x_train_df, cat_cols), y_train) + return pipe.predict_proba(_sanitize(x_test_df, cat_cols))[:, 1] + + y_train = train[TASK].astype("boolean").fillna(False).astype(int).values + y_test = test[TASK].astype("boolean").fillna(False).astype(int).values + base_rate = float(y_test.mean()) + """ + ), + code( + """ + num_eng = num_base + ENGINEERED_NUMERIC + + probs_lr_flat = fit_and_score( + train[base_cols], test[base_cols], num_base, cat_base, model="lr" + ) + probs_gbm_flat = fit_and_score( + train[base_cols], test[base_cols], num_base, cat_base, model="gbm" + ) + probs_lr_eng = fit_and_score( + train_eng[base_cols + ENGINEERED_NUMERIC], + test_eng[base_cols + ENGINEERED_NUMERIC], + num_eng, cat_base, model="lr", + ) + probs_gbm_eng = fit_and_score( + train_eng[base_cols + ENGINEERED_NUMERIC], + test_eng[base_cols + ENGINEERED_NUMERIC], + num_eng, cat_base, model="gbm", + ) + """ + ), + code( + """ + def panel(probs: np.ndarray, label: str) -> dict[str, float]: + return { + "label": label, + "auc": roc_auc_score(y_test, probs), + "ap": average_precision_score(y_test, probs), + "brier": brier_score_loss(y_test, probs), + "p@100": precision_at_k(probs, y_test, 100), + } + + rows = [ + panel(probs_lr_flat, "LR flat "), + panel(probs_gbm_flat, "GBM flat "), + panel(probs_lr_eng, "LR flat+rel "), + panel(probs_gbm_eng, "GBM flat+rel "), + ] + print(f"base rate: {base_rate:.3f}") + print() + print(f"{'model':<18s} {'AUC':>7s} {'AP':>7s} {'Brier':>7s} {'P@100':>7s}") + for r in rows: + print(f"{r['label']:<18s} {r['auc']:.4f} {r['ap']:.4f} {r['brier']:.4f} {r['p@100']:.4f}") + """ + ), + md( + """ + ## 6. Lift over the flat baseline + """ + ), + code( + """ + def delta(eng: np.ndarray, base: np.ndarray, name: str) -> dict[str, float]: + return { + "label": name, + "auc": roc_auc_score(y_test, eng) - roc_auc_score(y_test, base), + "ap": average_precision_score(y_test, eng) - average_precision_score(y_test, base), + "brier": brier_score_loss(y_test, eng) - brier_score_loss(y_test, base), + "p@100": precision_at_k(eng, y_test, 100) - precision_at_k(base, y_test, 100), + } + + deltas = [ + delta(probs_gbm_eng, probs_gbm_flat, "GBM(eng) - GBM(flat)"), + delta(probs_gbm_eng, probs_lr_flat, "GBM(eng) - LR(flat) "), + delta(probs_lr_eng, probs_lr_flat, "LR(eng) - LR(flat) "), + ] + print(f"{'comparison':<22s} {'ΔAUC':>8s} {'ΔAP':>8s} {'ΔBrier':>8s} {'ΔP@100':>8s}") + for d in deltas: + print( + f"{d['label']:<22s} {d['auc']:+8.4f} {d['ap']:+8.4f} " + f"{d['brier']:+8.4f} {d['p@100']:+8.4f}" + ) + """ + ), + md( + """ + ## 7. Honest takeaway + + The headline number is **GBM(eng) − GBM(flat) AUC**. On + seed 42, intermediate tier, this lift is positive and + non-trivial: HistGBM closes a meaningful share of the gap to + LR-on-flat once it has the engineered relational features to + chew on. + + However the lift does **not** flip the sign of the GBM-vs-LR + comparison: GBM(eng) is still slightly below LR(flat). This is + the same v1 finding documented in + `release/validation/validation_report.md` (gate **G7.4.4**) + and in the dataset card: the v1 snapshot is dominated by + roughly-linear signal, and HistGBM doesn't consistently beat + LR on it. Engineered relational features narrow the gap; they + don't yet erase it. + + Two takeaways for downstream users: + + 1. **Joins on the public bundle are leakage-safe by + construction.** Section 3 above is the full proof. You can + aggregate any of the four event tables without policing the + horizon yourself. + 2. **Bring your own non-linearities.** If you can find a + feature engineering choice (cross-table interactions, tree + kernels, learned embeddings) that flips the GBM-vs-LR sign, + please [open a `realism` issue](../docs/release/break_me_guide.md) + (template lands in PR 6.3) — that's a finding worth seeing. + + ## Next + + - **Notebook 03** *(PR 6.2)* — leakage and time-window + walkthrough, including the deliberate `total_touches_all` + trap. + - **Notebook 04** *(PR 6.2)* — value-aware ranking, + calibration, and cohort-shift evaluation. + """ + ), + ] + nb.metadata = { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3", + }, + "language_info": { + "name": "python", + "version": "3.11", + }, + } + return nb + + +def main() -> None: + nb = build() + OUT.parent.mkdir(parents=True, exist_ok=True) + text = json.dumps(nb, indent=1, sort_keys=True, ensure_ascii=False) + OUT.write_text(text + "\n", encoding="utf-8") + # Run ruff format on the emitted notebook so the builder's output + # matches the project's pre-commit hook byte-for-byte. Without this + # step a contributor running the builder would see pre-commit + # reformat their notebook on commit, defeating the audit-artifact- + # sync invariant. + subprocess.run(["ruff", "format", str(OUT)], check=True) # noqa: S603, S607 + print(f"wrote {OUT}") + + +if __name__ == "__main__": + main() diff --git a/tests/release/__init__.py b/tests/release/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/release/notebooks/__init__.py b/tests/release/notebooks/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/release/notebooks/test_execute_notebooks.py b/tests/release/notebooks/test_execute_notebooks.py new file mode 100644 index 0000000..7b0fcbc --- /dev/null +++ b/tests/release/notebooks/test_execute_notebooks.py @@ -0,0 +1,62 @@ +"""End-to-end execution gate for the public release notebooks (G13.1). + +Each notebook under ``release/notebooks/*.ipynb`` is executed top to +bottom with ``nbclient`` against the committed public release bundles. +A clean run is the contract: + +* G13.1 — every cell executes from a clean kernel without raising +* G13.2 — notebook 01's ``assert_within_tolerance`` cell pins notebook + metrics to ``release/validation/validation_report.md`` within ±0.05 +* G13.3 — neither notebook touches ``release/intermediate_instructor`` + or any other instructor artefact (enforced by the notebooks' own + ``BUNDLE = Path("../intermediate")`` path discipline; an instructor + load would fail the public-mode manifest assertion in cell 1) + +The test is gated on the public release bundles being on disk (matches +the HF/Kaggle smoke-test pattern in ``tests/scripts/test_package_*``). +``nbclient`` lives in the optional ``[notebooks]`` extra; ``importorskip`` +keeps the dev install lean while letting the dedicated CI job run the +gate against ``pip install -e ".[dev,scripts,notebooks]"``. +""" + +from __future__ import annotations + +from pathlib import Path + +import pytest + +_REPO_ROOT = Path(__file__).resolve().parents[3] +_NOTEBOOKS_DIR = _REPO_ROOT / "release" / "notebooks" +_RELEASE_BUNDLES_PRESENT = (_REPO_ROOT / "release" / "intermediate" / "manifest.json").exists() + +_NOTEBOOKS = [ + "01_baseline_lead_scoring.ipynb", + "02_relational_feature_engineering.ipynb", +] + + +@pytest.mark.skipif(not _RELEASE_BUNDLES_PRESENT, reason="release bundles not present") +@pytest.mark.parametrize("notebook_name", _NOTEBOOKS) +def test_notebook_executes_end_to_end(notebook_name: str) -> None: + """Execute the notebook with nbclient and surface any cell error. + + Each notebook hard-codes ``BUNDLE = Path("../intermediate")`` and + ``sys.path.insert(0, str(Path.cwd()))`` to import the sibling + ``_notebook_utils`` module — both work iff the kernel cwd is the + notebook directory, so we set ``resources={"metadata": {"path": ...}}`` + accordingly. + """ + nbformat = pytest.importorskip("nbformat", reason="nbformat not installed") + nbclient = pytest.importorskip("nbclient", reason="nbclient not installed") + pytest.importorskip("sklearn", reason="scikit-learn not installed") + pytest.importorskip("matplotlib", reason="matplotlib not installed") + + notebook_path = _NOTEBOOKS_DIR / notebook_name + nb = nbformat.read(notebook_path, as_version=4) + client = nbclient.NotebookClient( + nb, + timeout=180, + kernel_name="python3", + resources={"metadata": {"path": str(_NOTEBOOKS_DIR)}}, + ) + client.execute() diff --git a/tests/release/notebooks/test_notebook_utils.py b/tests/release/notebooks/test_notebook_utils.py new file mode 100644 index 0000000..ad09d95 --- /dev/null +++ b/tests/release/notebooks/test_notebook_utils.py @@ -0,0 +1,142 @@ +"""Unit tests for ``release/notebooks/_notebook_utils.py``. + +The notebook helpers ship inside the public bundle (consumers download +them alongside the parquet tables), so they cannot live inside the +``leadforge`` package import tree. These tests load the module through +``importlib`` and exercise it the way a notebook cell would. +""" + +from __future__ import annotations + +import importlib.util +import sys +from pathlib import Path + +import numpy as np +import pytest + +_MODULE_PATH = Path(__file__).resolve().parents[3] / "release" / "notebooks" / "_notebook_utils.py" +_spec = importlib.util.spec_from_file_location("_notebook_utils", _MODULE_PATH) +assert _spec is not None +assert _spec.loader is not None +nbu = importlib.util.module_from_spec(_spec) +sys.modules["_notebook_utils"] = nbu +_spec.loader.exec_module(nbu) + + +# --------------------------------------------------------------------------- +# precision_at_k +# --------------------------------------------------------------------------- + + +def test_precision_at_k_simple() -> None: + scores = np.array([0.9, 0.8, 0.7, 0.6, 0.5]) + y = np.array([1, 1, 0, 0, 0]) + assert nbu.precision_at_k(scores, y, 2) == pytest.approx(1.0) + assert nbu.precision_at_k(scores, y, 5) == pytest.approx(0.4) + + +def test_precision_at_k_handles_ties_via_stable_sort() -> None: + """Ties resolved by original order — same convention as + ``release_quality._precision_at_k`` so notebook and validation + report agree on tied-score rows. + """ + scores = np.array([0.5, 0.5, 0.5, 0.5]) + y = np.array([1, 0, 1, 0]) + # Stable argsort of -scores preserves [0,1,2,3] order, so top-2 = y[:2] + assert nbu.precision_at_k(scores, y, 2) == pytest.approx(0.5) + + +def test_precision_at_k_invalid_k_returns_nan() -> None: + scores = np.array([0.9, 0.8]) + y = np.array([1, 0]) + assert np.isnan(nbu.precision_at_k(scores, y, 0)) + assert np.isnan(nbu.precision_at_k(scores, y, 3)) + + +def test_top_decile_rate_uses_10_percent_cut() -> None: + rng = np.random.default_rng(0) + scores = rng.random(100) + y = (scores > 0.9).astype(int) + # Top-10 by score = exactly the 10 positives → top decile rate = 1.0 + assert nbu.top_decile_rate(scores, y) == pytest.approx(1.0) + + +def test_top_decile_rate_empty_returns_nan() -> None: + assert np.isnan(nbu.top_decile_rate(np.array([]), np.array([]))) + + +# --------------------------------------------------------------------------- +# assert_within_tolerance +# --------------------------------------------------------------------------- + + +def test_assert_within_tolerance_passes_when_inside_band() -> None: + nbu.assert_within_tolerance( + observed={"auc": 0.88, "ap": 0.57}, + target={"auc": 0.886, "ap": 0.575}, + tolerances=0.05, + ) + + +def test_assert_within_tolerance_fails_when_outside_band() -> None: + with pytest.raises(AssertionError) as exc: + nbu.assert_within_tolerance( + observed={"auc": 0.50}, + target={"auc": 0.886}, + tolerances=0.05, + ) + msg = str(exc.value) + assert "auc" in msg + assert "observed=0.5000" in msg + assert "target=0.8860" in msg + + +def test_assert_within_tolerance_per_metric_tolerances() -> None: + nbu.assert_within_tolerance( + observed={"auc": 0.83, "brier": 0.105}, + target={"auc": 0.886, "brier": 0.110}, + tolerances={"auc": 0.10, "brier": 0.05}, + ) + + +def test_assert_within_tolerance_reports_missing_key() -> None: + with pytest.raises(AssertionError, match="missing from observed metrics"): + nbu.assert_within_tolerance( + observed={"auc": 0.88}, + target={"auc": 0.886, "brier": 0.110}, + tolerances=0.05, + ) + + +def test_assert_within_tolerance_label_appears_in_error() -> None: + with pytest.raises(AssertionError, match="notebook 01"): + nbu.assert_within_tolerance( + observed={"auc": 0.5}, + target={"auc": 0.886}, + tolerances=0.05, + label="notebook 01", + ) + + +def test_assert_within_tolerance_aggregates_multiple_failures() -> None: + with pytest.raises(AssertionError) as exc: + nbu.assert_within_tolerance( + observed={"auc": 0.50, "ap": 0.10}, + target={"auc": 0.886, "ap": 0.575}, + tolerances=0.05, + ) + msg = str(exc.value) + assert "auc" in msg + assert "ap" in msg + + +def test_assert_within_tolerance_ignores_extra_observed_keys() -> None: + """Observed metrics may carry extras (e.g. GBM AUC); the gate only + enforces the keys present in ``target``. + """ + nbu.assert_within_tolerance( + observed={"auc": 0.88, "ap": 0.57, "extra": 999.0}, + target={"auc": 0.886, "ap": 0.575}, + tolerances=0.05, + ) From 5841d519a61ffbdcf1a279ab784103b4aae442f8 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Thu, 7 May 2026 07:39:32 +0300 Subject: [PATCH 2/6] .agent-plan.md: PR 6.1 notebook sequence kicks off Phase 6 Marks notebook 01 refresh and notebook 02 (relational feature engineering) done; leaves 03 (leakage + time windows) and 04 (lift / calibration / value ranking) plus the adversarial-framing artefacts (issue templates, break-me guide, v2 decision log) for later PRs. Co-Authored-By: Claude Opus 4.7 --- .agent-plan.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.agent-plan.md b/.agent-plan.md index 6ee4d2d..d9b545a 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -58,8 +58,8 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family - [x] PR 5.2 (self-review pass): brutal review of the first revision caught real bugs the reviewer would otherwise have to call out. Fixes folded into the PR before review: (#1) `run_packager` validate→write order — both packagers were writing the README/metadata even when validation failed; the validation gate now early-returns with `errors` populated and zero artifacts on disk; new tests exercise the no-write path on both sides. (#2) Instructor README was inlining the public 3-tier README for a 1-tier dataset; replaced with a dedicated `INSTRUCTOR_BODY` constant that opens by linking to the public dataset, describes only the instructor-specific additions (full-horizon tables, hidden DAG, latent registry, mechanism summary), and uses the single-tier tree block. (#3) `validate_upload_dir_safe` now also rejects strict descendants of `release_dir` (e.g. `--huggingface-dir release/intro` would otherwise rmtree the intro bundle); allow-list keeps the canonical `release/{kaggle,huggingface,huggingface-instructor}` direct-children. (#4) `[publish]` extra in `pyproject.toml` (`datasets>=2.14`, `kaggle>=1.6`) makes the gated `load_dataset()` / Kaggle-CLI tests installable in a single command — closes the "G12.3/G12.4 untested in CI" gap to a one-line install. (#5) Shared-primitives extraction finished: `SOURCE_TREE_BLOCK`, `validate_readme_substitution`, `replace_file`, `replace_dir`, `load_manifest` all moved to `scripts/_release_common.py`; both packagers reduced to imports. (#6) Hand-rolled YAML renderer (60 lines + brittle quoting heuristic) replaced with `yaml.safe_dump` + a 4-line `_IndentedDumper` subclass that forces indent-2 on top-level sequences. (#7) Dead `--owner` / `--dataset-slug` CLI flags removed (PR 7.2 will add them when actually needed). (#8) `assemble_upload_dir` now takes `rendered_readme` as a parameter and writes it itself; the public name no longer lies about producing a complete tree. (#9) `build_config_for_tier` made pure (no I/O); `_assert_tier_dir_exists` does the cheap manifest-stat preflight. (#10) `--default-config` with `--variant=instructor` now errors instead of silently ignoring. (#11) Instructor tree-diagram drops the hardcoded "9 tables" claim. (#13–#16) Visual cleanups (duplicate divider, ruff-split imports, `COVER_IMAGE_FILENAME`-vs-`Path.name` redundancy, speculative comment about HF split rename). (#17) Test cruft removed (unused `tmp_path`, dead `tag_lines`); em-dash YAML round-trip parametrised for the instructor `pretty_name`. Net: 1223/1223 tests pass + 5 gated skips (4 `datasets`-SDK round-trip + 1 Kaggle `kaggle`-SDK from PR 5.1); ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR doesn't touch the bundle shape). ### Phase 6 — Notebook sequence + adversarial framing -- [ ] `release/notebooks/{02_relational_feature_engineering,03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb` -- [ ] Update `01_baseline_lead_scoring.ipynb` to reproduce validation report metrics +- [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added. Notebook 01 trains LR + HistGBM on the public `intermediate` bundle, drops the `total_touches_all` leakage trap (referenced for pedagogy + forward-pointed to notebook 03), and pins its metrics to `release/validation/validation_report.md` within ±0.05 via a new `assert_within_tolerance` helper (G13.2 acceptance gate). Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline, demonstrates four legal joins (touch-channel breakdown, account-level density, sales-activity recency, train-only industry target encoding), then trains LR + GBM on flat-baseline-only and flat+relational features and prints a 4-row metric panel + delta panel. Honest takeaway cell quotes G7.4.4: GBM(eng)−GBM(flat) is the headline lift, GBM(eng)−LR(flat) stays slightly negative — consistent with the v1 finding that flat-CSV LR is hard to beat. Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` mirrors `release_quality._precision_at_k` (stable argsort), `top_decile_rate` and `assert_within_tolerance` round it out. G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the public bundles via `scripts/build_public_release.py` then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised over the two notebooks, gated on bundles-present using the same skip pattern as `test_package_*`). G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`. Builders (`scripts/_build_release_notebook_{01,02}.py`) emit deterministic byte-for-byte notebook JSON (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree. Net: 1246/1246 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. +- [ ] `release/notebooks/{03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb` - [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml` - [ ] `docs/release/{break_me_guide,v2_decision_log}.md` From 47889727a813fe2693e7fd330c86cd105c5c72e6 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Thu, 7 May 2026 08:25:47 +0300 Subject: [PATCH 3/6] PR 6.1 self-review fixes: byte-stable builders, real receipts, train-only joins, JSON-loaded targets MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Folds in the brutal self-review I posted on this PR before it goes to human review. Highlights: HIGH severity - **Builders are now byte-stable.** Cell IDs are assigned deterministically in ``scripts/_release_notebook_common.assemble_notebook`` (was ``nbformat.v4.new_*_cell``'s random UUID), and a new ``tests/scripts/test_release_notebook_builders.py`` runs each builder twice + diffs against the committed ``.ipynb``. The audit-artifact-sync invariant the docstring already advertised is now actually enforced — the test would have caught the previous revision's per-run diff. - **Snapshot-safe verification cell prints a real number.** The previous cell computed ``max_offset_days`` and threw it away, then printed a hardcoded ``(max offset 0.0 days)`` literal in every row. Replaced with ``min headroom under cutoff: {min_headroom_days:6.2f} days`` — formatted from ``(cutoff - timestamp).min()``. On the as-shipped intermediate bundle the receipt now reveals 735 events that land exactly on the cutoff (headroom 0.00 days), which is a real and teachable property of the data. - **``account_density`` is fit on train leads only.** Cell 4.2 was rolling up touches across train+test pooled — a leak path two cells above the train-only target encoding that loudly avoided the same mistake. Now filters ``tch_with_acct`` to ``train["lead_id"]`` before the groupby; test rows in train-only-empty accounts fall back to 0 via fillna. - **Notebook 02 has a tolerance gate.** Pinned the four model AUCs (LR-flat / GBM-flat / LR-eng / GBM-eng) and the headline ``GBM(eng) - GBM(flat)`` lift to per-metric ±0.02 (±0.015 for the lift), plus a sign-aware ``assert lift > 0``. Without it, a future regression that flipped the lift negative would have passed the nbclient gate silently. - **Per-metric tolerances replace the flat ±0.05.** Notebook 01 now uses ±0.02 on AUC/Brier (cross-seed std is well under that) and ±0.05 on AP / top-decile (which have higher seed variance). - **Notebook 01 actually reproduces the validation report.** The previous revision dropped ``total_touches_all`` (the leakage trap) while the report's panel keeps it — the metrics differed by ~0.03 on GBM AUC and the ±0.05 tolerance was hiding the discrepancy. Notebook 01 now keeps the trap (matches ``release_quality._partition_columns``) with explicit narrative + forward-pointer to notebook 03; notebook 02 drops the trap and explains why (relational lift attribution stays clean). MED severity - **``release/notebooks/_release_targets.json``** is the new source of truth for cross-seed-median targets. Notebook 01 loads it at runtime (no more hardcoded constants) and a new ``tests/release/notebooks/test_release_targets_match_report.py`` asserts byte equality against ``release/validation/validation_report.json``'s ``tiers..medians``. Drift now fails CI. Discovered along the way: the previous ``lr_precision_at_100: 0.59`` constant wasn't even *in* the validation report — the report's ranking metric is ``top_decile_rate``. Notebook 01 now computes and gates ``top_decile_rate`` (matches the report). - **``precision_at_k`` mirror test.** Asserts ``_notebook_utils.precision_at_k`` agrees with ``release_quality._precision_at_k`` over random + tied-score fixtures, locking the docstring claim in code. - **Builder duplication eliminated.** ``scripts/_release_notebook_common.py`` hosts ``md`` / ``code`` / ``assemble_notebook`` / ``write_notebook`` (incl. the deterministic-ID assignment and the ``ruff format`` subprocess); each per-notebook builder shrinks to imports + cell list + one ``main()``. - **Industry redundancy fixed.** The ``+rel`` pipelines drop the raw ``industry`` categorical so LR no longer sees one-hot industry *and* its train-only target encoding for the same column. - **Honest takeaway softened.** +0.0147 AUC framed as "suggestive, not conclusive" with a pointer to PR 6.2's seed-sweep notebook 04 + the cross-seed gbm_auc spread (0.027) for context. - **ΔBrier sign convention** spelled out under the delta panel. LOW severity - **Builder rename.** ``scripts/_build_release_notebook_*.py`` → ``scripts/build_release_notebook_*.py`` (drops the leading underscore that's inconsistent with every other script in ``scripts/``). - **CI builds only ``intermediate``.** ``build_public_release.py`` got a ``--tier`` filter; the notebooks job uses ``--tier intermediate``, saving 3 redundant bundle generations per CI run. - **Whitespace + idiom cleanups.** Fixed ``[[ "lead_id"`` extra space; replaced ``.values`` (pandas-deprecated) with ``.to_numpy()`` in both notebooks. - **Broken cross-references** to PR 6.2 / 6.3 / break_me_guide.md rewritten as plain-text "(coming in PR 6.x)" caveats — no link syntax to non-existent files. Verification on this branch: - ``ruff check . / ruff format --check .`` — clean - ``mypy leadforge/`` — clean (82 source files) - ``pytest`` — 1250 passed, 5 skipped (publish-extra-gated) - ``probe_relational_leakage`` on intro/intermediate/advanced — 0/3 paths flag rate on every tier - ``verify_hash_determinism`` — PASS 67/67 - ``validate_release_candidate --no-rebuild`` — PASS, 0 leakage findings - ``BUNDLE_SCHEMA_VERSION`` unchanged at 5 Co-Authored-By: Claude Opus 4.7 --- .github/workflows/ci.yml | 4 +- pyproject.toml | 2 +- .../notebooks/01_baseline_lead_scoring.ipynb | 106 +++--- .../02_relational_feature_engineering.ipynb | 171 ++++++--- release/notebooks/_release_targets.json | 10 + scripts/_release_notebook_common.py | 76 ++++ scripts/build_public_release.py | 15 +- ...ook_01.py => build_release_notebook_01.py} | 188 +++++----- ...ook_02.py => build_release_notebook_02.py} | 326 +++++++++++------- .../release/notebooks/test_notebook_utils.py | 34 ++ .../test_release_targets_match_report.py | 44 +++ .../scripts/test_release_notebook_builders.py | 81 +++++ 12 files changed, 745 insertions(+), 312 deletions(-) create mode 100644 release/notebooks/_release_targets.json create mode 100644 scripts/_release_notebook_common.py rename scripts/{_build_release_notebook_01.py => build_release_notebook_01.py} (69%) rename scripts/{_build_release_notebook_02.py => build_release_notebook_02.py} (62%) create mode 100644 tests/release/notebooks/test_release_targets_match_report.py create mode 100644 tests/scripts/test_release_notebook_builders.py diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index afc1fff..ba794cc 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -148,7 +148,7 @@ jobs: with: python-version: "3.12" - run: pip install -e ".[dev,scripts,notebooks]" - - name: Build public release bundles (intermediate tier) - run: python scripts/build_public_release.py release + - name: Build the intermediate public bundle (only tier the notebooks need) + run: python scripts/build_public_release.py release --tier intermediate - name: Execute release notebooks end-to-end run: pytest tests/release/notebooks/test_execute_notebooks.py -v diff --git a/pyproject.toml b/pyproject.toml index e689338..689426d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -96,7 +96,7 @@ select = ["E", "F", "I", "N", "W", "UP", "B", "C4", "PT", "S"] # markdown tables and print-statement output inside the notebook. # Line length is a property of the rendered cell, not the .py source, # so 100c is the wrong yardstick here. -"scripts/_build_release_notebook_*.py" = ["E501"] +"scripts/build_release_notebook_*.py" = ["E501"] [tool.mypy] python_version = "3.11" diff --git a/release/notebooks/01_baseline_lead_scoring.ipynb b/release/notebooks/01_baseline_lead_scoring.ipynb index f9281b2..e6d6b34 100644 --- a/release/notebooks/01_baseline_lead_scoring.ipynb +++ b/release/notebooks/01_baseline_lead_scoring.ipynb @@ -2,25 +2,26 @@ "cells": [ { "cell_type": "markdown", - "id": "68d902d7", + "id": "cell_000", "metadata": {}, - "source": "# Notebook 01 — Baseline Lead Scoring\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier (the\nrelease default).\n\n**Goal:** train Logistic Regression and Histogram Gradient Boosting\nbaselines on the snapshot-safe public bundle, and verify they\nreproduce the cross-seed-median metrics in\n[`release/validation/validation_report.md`](../validation/validation_report.md)\nwithin the **±0.05** tolerance fixed by acceptance gate **G13.2**.\n\n**Public path discipline (G13.3).** This notebook reads only from\nthe public bundle at `release/intermediate/`. The instructor\ncompanion (`release/intermediate_instructor/`, with full-horizon\nevent tables, the latent registry, the hidden DAG, and the\nmechanism summary) is **not** loaded — public modelling work must\nnever depend on instructor-only artefacts." + "source": "# Notebook 01 — Baseline Lead Scoring\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier (the\nrelease default).\n\n**Goal:** train Logistic Regression and Histogram Gradient Boosting\nbaselines on the snapshot-safe public bundle, and verify they\nreproduce the cross-seed-median metrics in\n[`release/validation/validation_report.md`](../validation/validation_report.md)\nwithin the per-metric tolerances fixed by acceptance gate **G13.2**.\n\n**Public path discipline (G13.3).** This notebook reads only from\nthe public bundle at `release/intermediate/`. The instructor\ncompanion (`release/intermediate_instructor/`, with full-horizon\nevent tables, the latent registry, the hidden DAG, and the\nmechanism summary) is **not** loaded — public modelling work must\nnever depend on instructor-only artefacts." }, { "cell_type": "markdown", - "id": "b89f80f8", + "id": "cell_001", "metadata": {}, "source": "## 1. Setup" }, { "cell_type": "code", "execution_count": null, - "id": "3c83d9f4", + "id": "cell_002", "metadata": {}, "outputs": [], "source": [ "from __future__ import annotations\n", "\n", + "import json\n", "import sys\n", "from pathlib import Path\n", "\n", @@ -39,7 +40,11 @@ "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", "\n", "sys.path.insert(0, str(Path.cwd()))\n", - "from _notebook_utils import assert_within_tolerance, precision_at_k\n", + "from _notebook_utils import (\n", + " assert_within_tolerance,\n", + " precision_at_k,\n", + " top_decile_rate,\n", + ")\n", "\n", "SEED = 42\n", "BUNDLE = Path(\"../intermediate\") # public student bundle\n", @@ -48,45 +53,54 @@ }, { "cell_type": "markdown", - "id": "32596245", + "id": "cell_003", "metadata": {}, - "source": "## 2. Reproduction targets\n\nWe pin the cross-seed median metrics from\n`release/validation/validation_report.md` (intermediate tier).\nEach is a Logistic Regression score on the test split unless\notherwise noted; ranking-based metrics use stable argsort on the\nLR probability so ties resolve identically to the validator." + "source": "## 2. Reproduction targets\n\nWe pin the cross-seed-median metrics for the *intermediate* tier\n(seeds 42–46) from `release/validation/validation_report.json`.\nThe targets live in a sibling file\n(`release/notebooks/_release_targets.json`) so they can't drift\nfrom the validation report without an audit-sync test failure\nin CI.\n\n**Per-metric tolerances** are tighter than a flat 5 % band: the\ncross-seed standard deviation in the report is well under 0.02\non AUC and Brier, and a flat ±0.05 would let a regression slip\nthrough. Average-precision and the small-`k` `top_decile_rate`\nstay at ±0.05 because their seed-to-seed variance is larger." }, { "cell_type": "code", "execution_count": null, - "id": "05e1df57", + "id": "cell_004", "metadata": {}, "outputs": [], "source": [ - "# Cross-seed medians from release/validation/validation_report.md\n", - "# (intermediate tier; seeds 42-46). See ``$.tiers.intermediate.medians``\n", - "# in the JSON sibling for the source of truth.\n", + "with (Path.cwd() / \"_release_targets.json\").open() as fh:\n", + " targets = json.load(fh)[\"intermediate\"]\n", + "\n", + "# Re-key the validation report's metric names into the metric\n", + "# names this notebook prints below, so the gate compares apples\n", + "# to apples.\n", "VALIDATION_REPORT_TARGETS = {\n", - " \"lr_auc\": 0.8859,\n", - " \"gbm_auc\": 0.8755,\n", - " \"lr_average_precision\": 0.5752,\n", - " \"lr_brier\": 0.1096,\n", - " \"lr_precision_at_100\": 0.59,\n", + " \"lr_auc\": targets[\"lr_auc\"],\n", + " \"gbm_auc\": targets[\"gbm_auc\"],\n", + " \"lr_average_precision\": targets[\"lr_average_precision\"],\n", + " \"lr_brier\": targets[\"brier_score\"],\n", + " \"lr_top_decile_rate\": targets[\"top_decile_rate\"],\n", "}\n", - "TOLERANCE = 0.05 # G13.2" + "TOLERANCES = {\n", + " \"lr_auc\": 0.02, # G13.2 — tighter than a flat 5%\n", + " \"gbm_auc\": 0.02,\n", + " \"lr_average_precision\": 0.05, # higher seed variance\n", + " \"lr_brier\": 0.02,\n", + " \"lr_top_decile_rate\": 0.05, # small-k variance\n", + "}\n", + "for k, v in VALIDATION_REPORT_TARGETS.items():\n", + " print(f\" target {k:<24s} {v:.4f} (tol ±{TOLERANCES[k]:.2f})\")" ] }, { "cell_type": "markdown", - "id": "8fb8dbb6", + "id": "cell_005", "metadata": {}, "source": "## 3. Load the bundle\n\nWe load the parquet task splits — the canonical format the\nrelease ships in. The accompanying `lead_scoring.csv` is a\nconvenience export with the same rows but coerced dtypes;\nsticking with parquet preserves nullable `Int64` / `Float64` /\n`boolean` columns the way the validator sees them." }, { "cell_type": "code", "execution_count": null, - "id": "b58b73cf", + "id": "cell_006", "metadata": {}, "outputs": [], "source": [ - "import json\n", - "\n", "train = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"train.parquet\")\n", "valid = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"valid.parquet\")\n", "test = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"test.parquet\")\n", @@ -112,21 +126,22 @@ }, { "cell_type": "markdown", - "id": "8e0e964c", + "id": "cell_007", "metadata": {}, - "source": "## 4. Feature selection\n\nThe feature dictionary flags one column as a deliberate **leakage\ntrap**: `total_touches_all` counts touches over the full 90-day\nhorizon (post-snapshot data). We drop it from the baseline so\nthe comparison against the validation report is honest.\n\nNotebook 03 (the leakage walkthrough, shipping in PR 6.2) exists\nspecifically to show what happens if you keep it." + "source": "## 4. Feature selection\n\nWe use the **same feature set as `release/validation/validation_report.json`**\nso the gate in section 7 is a real reproduction check rather\nthan a related-but-different number. That means we drop only\nthe IDs and the label — every other column in `train` (including\n`total_touches_all`, the documented leakage trap) goes into the\npipeline.\n\n**About `total_touches_all`.** The feature dictionary flags it\nwith `leakage_risk = True`: it counts touches over the full\n90-day horizon, which is post-snapshot data. The validation\nreport keeps it in the panel anyway because (a) its standalone\nAUC is barely above 0.55 (see the *post_snapshot_aggregates*\nbaseline column in the report) and (b) the report exists to\nmeasure the v1 dataset's *as-shipped* difficulty, leakage trap\nincluded. **Notebook 03** *(coming in PR 6.2)* walks through\nwhat dropping the trap does to performance and how to detect\nsimilar traps from feature audits alone." }, { "cell_type": "code", "execution_count": null, - "id": "d0242bbb", + "id": "cell_008", "metadata": {}, "outputs": [], "source": [ "feat_dict = pd.read_csv(BUNDLE / \"feature_dictionary.csv\")\n", "trap_cols = feat_dict.loc[feat_dict[\"leakage_risk\"].astype(bool), \"name\"].tolist()\n", "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n", - "EXCLUDE = set(ID_COLS + trap_cols + [TASK])\n", + "# Mirrors ``release_quality._partition_columns`` — IDs + label only.\n", + "EXCLUDE = set(ID_COLS + [TASK])\n", "\n", "feature_cols = [c for c in train.columns if c not in EXCLUDE]\n", "cat_cols = [\n", @@ -136,21 +151,21 @@ "]\n", "num_cols = [c for c in feature_cols if c not in cat_cols]\n", "\n", - "print(f\"Leakage-trap columns dropped: {trap_cols}\")\n", + "print(f\"Leakage-trap columns kept (see narrative above): {trap_cols}\")\n", "print(f\"Categorical features ({len(cat_cols)}): {cat_cols}\")\n", "print(f\"Numeric features ({len(num_cols)}): {num_cols}\")" ] }, { "cell_type": "markdown", - "id": "24db5070", + "id": "cell_009", "metadata": {}, "source": "## 5. Preprocessing pipeline\n\nMirrors `leadforge.validation.release_quality._build_pipeline`\nso the notebook's metric panel and the validation report's\nmetric panel agree by construction:\n\n- numeric: median-impute, then `StandardScaler`\n- categorical: most-frequent-impute, then dense `OneHotEncoder`\n with `handle_unknown=\"ignore\"`" }, { "cell_type": "code", "execution_count": null, - "id": "aa90d33c", + "id": "cell_010", "metadata": {}, "outputs": [], "source": [ @@ -163,8 +178,8 @@ "\n", "x_train = _sanitize_categoricals(train[feature_cols])\n", "x_test = _sanitize_categoricals(test[feature_cols])\n", - "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).values\n", - "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).values\n", + "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", + "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", "\n", "numeric_t = Pipeline([(\"imputer\", SimpleImputer(strategy=\"median\")), (\"scaler\", StandardScaler())])\n", "categorical_t = Pipeline(\n", @@ -184,14 +199,14 @@ }, { "cell_type": "markdown", - "id": "e2fd9f82", + "id": "cell_011", "metadata": {}, "source": "## 6. Train baselines and score the test split" }, { "cell_type": "code", "execution_count": null, - "id": "972b8207", + "id": "cell_012", "metadata": {}, "outputs": [], "source": [ @@ -222,6 +237,9 @@ " \"gbm_auc\": float(roc_auc_score(y_test, gbm_probs)),\n", " \"lr_average_precision\": float(average_precision_score(y_test, lr_probs)),\n", " \"lr_brier\": float(brier_score_loss(y_test, lr_probs)),\n", + " \"lr_top_decile_rate\": top_decile_rate(lr_probs, y_test),\n", + " # Print-only; not pinned (the validation report tracks\n", + " # ``top_decile_rate`` instead, which we gate above).\n", " \"lr_precision_at_50\": precision_at_k(lr_probs, y_test, 50),\n", " \"lr_precision_at_100\": precision_at_k(lr_probs, y_test, 100),\n", " \"lr_precision_at_200\": precision_at_k(lr_probs, y_test, 200),\n", @@ -232,36 +250,36 @@ }, { "cell_type": "markdown", - "id": "dcafd0ff", + "id": "cell_013", "metadata": {}, - "source": "## 7. Tolerance check (G13.2)\n\nThe notebook's printed metrics must match the cross-seed medians\nin `validation_report.md` to within ±0.05. If a future change\nbreaks this, the assertion below fails — and CI catches it,\nbecause the same cell runs under `nbclient` in the\n`notebooks` job." + "source": "## 7. Tolerance check (G13.2)\n\nThe notebook's printed metrics must match the cross-seed medians\nin `validation_report.json` to within the per-metric tolerances\ndeclared in section 2. If a future change breaks this, the\nassertion below fails — and CI catches it, because the same\ncell runs under `nbclient` in the `notebooks` job." }, { "cell_type": "code", "execution_count": null, - "id": "a1098b05", + "id": "cell_014", "metadata": {}, "outputs": [], "source": [ "assert_within_tolerance(\n", " observed=metrics,\n", " target=VALIDATION_REPORT_TARGETS,\n", - " tolerances=TOLERANCE,\n", - " label=\"notebook 01 vs validation_report.md (intermediate tier)\",\n", + " tolerances=TOLERANCES,\n", + " label=\"notebook 01 vs validation_report.json (intermediate tier)\",\n", ")\n", - "print(\"OK — all reported metrics are within ±0.05 of the validation report medians.\")" + "print(\"OK — all gated metrics are within their per-metric tolerance.\")" ] }, { "cell_type": "markdown", - "id": "c90a0f46", + "id": "cell_015", "metadata": {}, "source": "## 8. Decile lift chart\n\nStandard sanity-check for ranking quality: sort the test set by\nscore, bucket into deciles, plot the per-decile conversion rate\nvs the base rate." }, { "cell_type": "code", "execution_count": null, - "id": "06ebd46a", + "id": "cell_016", "metadata": {}, "outputs": [], "source": [ @@ -293,14 +311,14 @@ }, { "cell_type": "markdown", - "id": "9c89ba88", + "id": "cell_017", "metadata": {}, "source": "## 9. Calibration plot\n\nReliability diagram: bin predicted probabilities into 10 equal-\nwidth buckets, plot mean predicted vs mean observed. The\nvalidation report's reference reliability plot for the\nintermediate tier lives at\n`release/validation/figures/calibration_intermediate.png`." }, { "cell_type": "code", "execution_count": null, - "id": "4650343d", + "id": "cell_018", "metadata": {}, "outputs": [], "source": [ @@ -330,9 +348,9 @@ }, { "cell_type": "markdown", - "id": "b4a85526", + "id": "cell_019", "metadata": {}, - "source": "## 10. Next\n\n- **Notebook 02** — engineer features by joining the snapshot-\n safe relational tables under `release/intermediate/tables/`,\n then measure the lift over the flat-CSV LR baseline above.\n- **Notebook 03** *(PR 6.2)* — leakage and time-window\n walkthrough; works through what `total_touches_all` does to\n your AUC if you forget to drop it.\n- **Notebook 04** *(PR 6.2)* — value-aware ranking\n (`expected_acv` × P(convert)), threshold selection, and the\n cohort-shift stress test." + "source": "## 10. Next\n\n- **Notebook 02** — engineer features by joining the snapshot-\n safe relational tables under `release/intermediate/tables/`,\n then measure the lift over the flat-CSV LR baseline above.\n- **Notebook 03** *(coming in PR 6.2)* — leakage and time-window\n walkthrough; works through what `total_touches_all` does to\n your AUC if you forget to drop it.\n- **Notebook 04** *(coming in PR 6.2)* — value-aware ranking\n (`expected_acv` × P(convert)), threshold selection, and the\n cohort-shift stress test." } ], "metadata": { diff --git a/release/notebooks/02_relational_feature_engineering.ipynb b/release/notebooks/02_relational_feature_engineering.ipynb index 86e3c06..5d8cf46 100644 --- a/release/notebooks/02_relational_feature_engineering.ipynb +++ b/release/notebooks/02_relational_feature_engineering.ipynb @@ -2,20 +2,20 @@ "cells": [ { "cell_type": "markdown", - "id": "50b3ab16", + "id": "cell_000", "metadata": {}, - "source": "# Notebook 02 — Relational Feature Engineering\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.\n\nThe flat task split in notebook 01 is one snapshot view of a\nricher relational world. The public bundle ships seven\n**snapshot-safe** tables under `release/intermediate/tables/`:\nevery event row is filtered to `timestamp <= lead_created_at +\nsnapshot_day`, so any join you can write is **leakage-safe by\nconstruction**.\n\nThis notebook walks through:\n\n1. Loading the seven public tables.\n2. Verifying the snapshot-safe contract inline (the teachable\n moment — see the contract, don't just read about it).\n3. Engineering four relational features.\n4. Training a HistGBM on `flat ∪ engineered` columns.\n5. Reporting the AUC / AP / Brier / P@K delta vs the flat-CSV\n baseline from notebook 01.\n\n**Public path discipline (G13.3).** This notebook reads only\nfrom `release/intermediate/`. The instructor companion\n(`release/intermediate_instructor/`) is **not** loaded —\nrelational feature engineering must work from the public\nartefact alone. Tables omitted from the public bundle on\npurpose (`customers`, `subscriptions`) live only in the\ninstructor companion because their mere existence reconstructs\nthe label." + "source": "# Notebook 02 — Relational Feature Engineering\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.\n\nThe flat task split in notebook 01 is one snapshot view of a\nricher relational world. The public bundle ships seven\n**snapshot-safe** tables under `release/intermediate/tables/`:\nevery event row is filtered to `timestamp <= lead_created_at +\nsnapshot_day`, so any join you can write is **leakage-safe by\nconstruction**.\n\nThis notebook walks through:\n\n1. Loading the seven public tables.\n2. Verifying the snapshot-safe contract inline (the teachable\n moment — see the contract pass, don't just read about it).\n3. Engineering four relational features, with train-only\n discipline for any aggregation that crosses splits.\n4. Training a HistGBM on `flat ∪ engineered` columns.\n5. Reporting the AUC / AP / Brier / P@K delta vs the flat-CSV\n baseline, with a tolerance gate that fails CI if the\n headline lift regresses.\n\n**Public path discipline (G13.3).** This notebook reads only\nfrom `release/intermediate/`. The instructor companion\n(`release/intermediate_instructor/`) is **not** loaded —\nrelational feature engineering must work from the public\nartefact alone. Tables omitted from the public bundle on\npurpose (`customers`, `subscriptions`) live only in the\ninstructor companion because their mere existence reconstructs\nthe label.\n\n**Leakage-trap discipline.** Unlike notebook 01 (which\nreproduces the validation report's panel verbatim and\ntherefore keeps `total_touches_all`), notebook 02 **drops**\nthe trap from the flat baseline. We're teaching feature\nengineering here; mixing a known-leaky column into the\n\"before\" panel would muddy the relational lift attribution." }, { "cell_type": "markdown", - "id": "d190c384", + "id": "cell_001", "metadata": {}, "source": "## 1. Setup" }, { "cell_type": "code", "execution_count": null, - "id": "9c8bd996", + "id": "cell_002", "metadata": {}, "outputs": [], "source": [ @@ -40,7 +40,7 @@ "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", "\n", "sys.path.insert(0, str(Path.cwd()))\n", - "from _notebook_utils import precision_at_k\n", + "from _notebook_utils import assert_within_tolerance, precision_at_k\n", "\n", "SEED = 42\n", "BUNDLE = Path(\"../intermediate\") # public student bundle\n", @@ -56,14 +56,14 @@ }, { "cell_type": "markdown", - "id": "660c7c32", + "id": "cell_003", "metadata": {}, "source": "## 2. Load the seven public tables\n\nThese are the only tables present under `release/intermediate/\ntables/`. `customers` and `subscriptions` are deliberately\nabsent — they only exist for *converted* leads, so their\npresence in a public bundle would reconstruct the label." }, { "cell_type": "code", "execution_count": null, - "id": "7255dda9", + "id": "cell_004", "metadata": {}, "outputs": [], "source": [ @@ -86,14 +86,14 @@ }, { "cell_type": "markdown", - "id": "b5de76b4", + "id": "cell_005", "metadata": {}, - "source": "## 3. Verify the snapshot-safe contract\n\nFor every event-table row joined back to its lead, assert\n`timestamp <= lead_created_at + snapshot_day`. This is the\ninvariant that makes any join you can write leakage-safe — and\nit's worth seeing it pass before relying on it." + "source": "## 3. Verify the snapshot-safe contract\n\nFor every event-table row joined back to its lead, assert\n`timestamp <= lead_created_at + snapshot_day`. The reported\n**headroom** is the *minimum* gap any event row leaves between\nits timestamp and the snapshot cutoff — a non-negative number\nwhen the contract holds. Showing the actual minimum (rather\nthan just \"all <= cutoff\") makes the receipt honest: if a\nfuture regeneration ever shaves the contract close, you'll see\nthe headroom shrink before the assertion fires." }, { "cell_type": "code", "execution_count": null, - "id": "7d270e29", + "id": "cell_006", "metadata": {}, "outputs": [], "source": [ @@ -115,8 +115,12 @@ " df[ts_col] = pd.to_datetime(df[ts_col])\n", " violations = df[df[ts_col] > df[\"cutoff\"]]\n", " assert len(violations) == 0, f\"{tbl}.{ts_col}: {len(violations)} rows past snapshot cutoff\"\n", - " max_offset_days = (df[ts_col] - df[\"lead_created_at\" if False else \"cutoff\"]).max()\n", - " print(f\" {tbl}.{ts_col}: {len(df):>6,} rows, all <= cutoff (max offset 0.0 days)\")\n", + " min_headroom = (df[\"cutoff\"] - df[ts_col]).min()\n", + " min_headroom_days = float(min_headroom.total_seconds()) / 86400.0\n", + " print(\n", + " f\" {tbl}.{ts_col}: {len(df):>6,} rows; \"\n", + " f\"min headroom under cutoff: {min_headroom_days:6.2f} days\"\n", + " )\n", "\n", "print()\n", "print(\"OK — every event row in every public event table satisfies\")\n", @@ -125,20 +129,20 @@ }, { "cell_type": "markdown", - "id": "bd27322d", + "id": "cell_007", "metadata": {}, - "source": "## 4. Engineered features\n\nWe build four relational features. Each one starts as a\nper-lead aggregate over a single snapshot-safe table, then\njoins back into the per-lead snapshot.\n\n| # | Feature | Source table(s) | Aggregation |\n|---|---|---|---|\n| 1 | `touches_ch_*` (3 cols) | `touches` | per-lead × `touch_channel` count |\n| 2 | `account_avg_touches_per_lead` | `touches`, `leads` | account-level rollup, then merge back |\n| 3 | `days_since_last_activity` | `sales_activities`, `leads` | per-lead recency at snapshot cutoff |\n| 4 | `industry_target_encoding_train` | `accounts`, `train` | mean-target encoding **fit on train only** |" + "source": "## 4. Engineered features\n\nWe build four relational features. Each one starts as a\nper-lead aggregate over a single snapshot-safe table, then\njoins back into the per-lead snapshot. Aggregations that pool\nacross leads (account-level density, target encoding) are\n**fit on the train split only** and applied to test via\njoin — same train-only discipline that prevents target leakage\nin mean encoding.\n\n| # | Feature | Source table(s) | Aggregation |\n|---|---|---|---|\n| 1 | `touches_ch_*` (3 cols) | `touches` | per-lead × `touch_channel` count |\n| 2 | `account_avg_touches_per_lead` | `touches`, `leads`, train lead set | account-level rollup over **train leads only**, then merge back |\n| 3 | `days_since_last_activity` | `sales_activities`, `leads` | per-lead recency at snapshot cutoff |\n| 4 | `industry_target_encoding_train` | `accounts`, train | mean-target encoding **fit on train only** |" }, { "cell_type": "markdown", - "id": "b2327091", + "id": "cell_008", "metadata": {}, "source": "### 4.1 Touch-channel breakdown" }, { "cell_type": "code", "execution_count": null, - "id": "57b3b3b0", + "id": "cell_009", "metadata": {}, "outputs": [], "source": [ @@ -151,20 +155,22 @@ }, { "cell_type": "markdown", - "id": "a06ca3d2", + "id": "cell_010", "metadata": {}, - "source": "### 4.2 Account-level touch density\n\nHow active is the account this lead belongs to, on average? An\naccount with many leads and many touches per lead is a\nstructurally different prospect than a one-touch account." + "source": "### 4.2 Account-level touch density\n\nHow active is the account this lead belongs to, on average? An\naccount with many leads and many touches per lead is a\nstructurally different prospect than a one-touch account.\n\n**Train-only.** We compute the account-level rollup using only\ntrain leads' touches. Test leads in train-only-empty accounts\nfall back to 0 via `fillna`. This avoids letting test rows\ninfluence the train feature distribution — same discipline\napplied to mean-target encoding in 4.4." }, { "cell_type": "code", "execution_count": null, - "id": "9b579133", + "id": "cell_011", "metadata": {}, "outputs": [], "source": [ + "train_lead_ids = set(train[\"lead_id\"].tolist())\n", "tch_with_acct = touches.merge(tables[\"leads\"][[\"lead_id\", \"account_id\"]], on=\"lead_id\", how=\"left\")\n", + "tch_train = tch_with_acct[tch_with_acct[\"lead_id\"].isin(train_lead_ids)]\n", "account_density = (\n", - " tch_with_acct.groupby(\"account_id\")\n", + " tch_train.groupby(\"account_id\")\n", " .agg(\n", " account_total_touches=(\"touch_id\", \"count\"),\n", " account_lead_count=(\"lead_id\", \"nunique\"),\n", @@ -176,19 +182,20 @@ " )\n", " .reset_index()\n", ")\n", + "print(f\"account_density rows: {len(account_density):,} (accounts represented in train touches)\")\n", "account_density.head()" ] }, { "cell_type": "markdown", - "id": "82a01e14", + "id": "cell_012", "metadata": {}, "source": "### 4.3 Sales-activity recency at snapshot\n\nDays between the lead's most recent sales activity and the\nsnapshot cutoff (`lead_created_at + snapshot_day`). Recency\nis a classic engagement signal that's surprisingly hard to\nrecover from the flat snapshot directly." }, { "cell_type": "code", "execution_count": null, - "id": "27455280", + "id": "cell_013", "metadata": {}, "outputs": [], "source": [ @@ -209,14 +216,14 @@ }, { "cell_type": "markdown", - "id": "a7ad7015", + "id": "cell_014", "metadata": {}, "source": "### 4.4 Industry target encoding (train-only, leakage-safe)\n\nReplace the `industry` string with the conversion rate observed\nfor that industry **on the training split only**. Computing\nthe encoding on test leaks the test labels into the features —\na textbook mistake; we avoid it explicitly." }, { "cell_type": "code", "execution_count": null, - "id": "8b414ff8", + "id": "cell_015", "metadata": {}, "outputs": [], "source": [ @@ -230,14 +237,14 @@ }, { "cell_type": "markdown", - "id": "27067494", + "id": "cell_016", "metadata": {}, "source": "### 4.5 Stitch features onto train and test" }, { "cell_type": "code", "execution_count": null, - "id": "4caa0426", + "id": "cell_017", "metadata": {}, "outputs": [], "source": [ @@ -278,14 +285,14 @@ }, { "cell_type": "markdown", - "id": "ada700a7", + "id": "cell_018", "metadata": {}, - "source": "## 5. Baseline + engineered models\n\nSame pipeline as notebook 01 (mirrors\n`leadforge.validation.release_quality._build_pipeline`). We\ntrain four models so the comparison is fair:\n\n| Model | Features | Compares against |\n|---|---|---|\n| LR-flat | flat snapshot only | (validation report baseline) |\n| GBM-flat | flat snapshot only | LR-flat |\n| LR-eng | flat ∪ engineered | LR-flat |\n| GBM-eng | flat ∪ engineered | GBM-flat — the headline lift |" + "source": "## 5. Baseline + engineered models\n\nSame pipeline as notebook 01 (mirrors\n`leadforge.validation.release_quality._build_pipeline`). We\ntrain four models so the comparison is fair:\n\n| Model | Features | Compares against |\n|---|---|---|\n| LR-flat | flat snapshot, no trap | (validation report baseline) |\n| GBM-flat | flat snapshot, no trap | LR-flat |\n| LR-eng | flat ∪ engineered, no trap, no raw `industry` | LR-flat |\n| GBM-eng | flat ∪ engineered, no trap, no raw `industry` | GBM-flat — the headline lift |\n\nThe `+rel` pipelines drop the raw `industry` categorical\nbecause the train-only target encoding already represents it\nas a numeric column — feeding both would feed the same column\ntwice." }, { "cell_type": "code", "execution_count": null, - "id": "03a228b9", + "id": "cell_019", "metadata": {}, "outputs": [], "source": [ @@ -301,14 +308,18 @@ " if not (pd.api.types.is_bool_dtype(train[c]) or pd.api.types.is_numeric_dtype(train[c]))\n", "]\n", "num_base = [c for c in base_cols if c not in cat_base]\n", + "# Drop raw `industry` from the +rel categorical list so the LR\n", + "# pipeline doesn't see it twice (one-hot + target-encoded).\n", + "cat_eng = [c for c in cat_base if c != \"industry\"]\n", "print(f\"flat features: {len(base_cols)} (numeric={len(num_base)}, categorical={len(cat_base)})\")\n", - "print(f\"engineered (numeric only): {len(ENGINEERED_NUMERIC)}\")" + "print(f\"engineered (numeric only): {len(ENGINEERED_NUMERIC)}\")\n", + "print(f\"+rel categorical list drops: {set(cat_base) - set(cat_eng)}\")" ] }, { "cell_type": "code", "execution_count": null, - "id": "3d2b1d93", + "id": "cell_020", "metadata": {}, "outputs": [], "source": [ @@ -321,12 +332,18 @@ "\n", "def build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline:\n", " numeric_t = Pipeline(\n", - " [(\"imputer\", SimpleImputer(strategy=\"median\")), (\"scaler\", StandardScaler())]\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"median\")),\n", + " (\"scaler\", StandardScaler()),\n", + " ]\n", " )\n", " cat_t = Pipeline(\n", " [\n", " (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n", - " (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n", + " (\n", + " \"encoder\",\n", + " OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False),\n", + " ),\n", " ]\n", " )\n", " pre = ColumnTransformer(\n", @@ -353,15 +370,15 @@ " return pipe.predict_proba(_sanitize(x_test_df, cat_cols))[:, 1]\n", "\n", "\n", - "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).values\n", - "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).values\n", + "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", + "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", "base_rate = float(y_test.mean())" ] }, { "cell_type": "code", "execution_count": null, - "id": "40b3c459", + "id": "cell_021", "metadata": {}, "outputs": [], "source": [ @@ -373,14 +390,14 @@ " train_eng[base_cols + ENGINEERED_NUMERIC],\n", " test_eng[base_cols + ENGINEERED_NUMERIC],\n", " num_eng,\n", - " cat_base,\n", + " cat_eng,\n", " model=\"lr\",\n", ")\n", "probs_gbm_eng = fit_and_score(\n", " train_eng[base_cols + ENGINEERED_NUMERIC],\n", " test_eng[base_cols + ENGINEERED_NUMERIC],\n", " num_eng,\n", - " cat_base,\n", + " cat_eng,\n", " model=\"gbm\",\n", ")" ] @@ -388,16 +405,16 @@ { "cell_type": "code", "execution_count": null, - "id": "54dd36b8", + "id": "cell_022", "metadata": {}, "outputs": [], "source": [ "def panel(probs: np.ndarray, label: str) -> dict[str, float]:\n", " return {\n", " \"label\": label,\n", - " \"auc\": roc_auc_score(y_test, probs),\n", - " \"ap\": average_precision_score(y_test, probs),\n", - " \"brier\": brier_score_loss(y_test, probs),\n", + " \"auc\": float(roc_auc_score(y_test, probs)),\n", + " \"ap\": float(average_precision_score(y_test, probs)),\n", + " \"brier\": float(brier_score_loss(y_test, probs)),\n", " \"p@100\": precision_at_k(probs, y_test, 100),\n", " }\n", "\n", @@ -417,24 +434,24 @@ }, { "cell_type": "markdown", - "id": "b9f1b3fc", + "id": "cell_023", "metadata": {}, - "source": "## 6. Lift over the flat baseline" + "source": "## 6. Lift over the flat baseline\n\n*Sign convention.* `ΔAUC`, `ΔAP`, `ΔP@100`: **higher is\nbetter** (positive = engineered features helped). `ΔBrier`:\n**lower is better** (Brier is a loss, so negative ΔBrier =\nimproved calibration)." }, { "cell_type": "code", "execution_count": null, - "id": "5afd0015", + "id": "cell_024", "metadata": {}, "outputs": [], "source": [ "def delta(eng: np.ndarray, base: np.ndarray, name: str) -> dict[str, float]:\n", " return {\n", " \"label\": name,\n", - " \"auc\": roc_auc_score(y_test, eng) - roc_auc_score(y_test, base),\n", - " \"ap\": average_precision_score(y_test, eng) - average_precision_score(y_test, base),\n", - " \"brier\": brier_score_loss(y_test, eng) - brier_score_loss(y_test, base),\n", - " \"p@100\": precision_at_k(eng, y_test, 100) - precision_at_k(base, y_test, 100),\n", + " \"auc\": float(roc_auc_score(y_test, eng) - roc_auc_score(y_test, base)),\n", + " \"ap\": float(average_precision_score(y_test, eng) - average_precision_score(y_test, base)),\n", + " \"brier\": float(brier_score_loss(y_test, eng) - brier_score_loss(y_test, base)),\n", + " \"p@100\": float(precision_at_k(eng, y_test, 100) - precision_at_k(base, y_test, 100)),\n", " }\n", "\n", "\n", @@ -453,9 +470,61 @@ }, { "cell_type": "markdown", - "id": "a4130dff", + "id": "cell_025", + "metadata": {}, + "source": "## 7. Tolerance gate\n\nPin the four model AUCs and the headline lift to per-metric\ntolerances so a regression in any pipeline component (feature\nengineering, leakage discipline, sklearn upgrade) breaks CI.\nTargets and tolerances are seed-42-specific by design — this\nnotebook is reproducible-by-seed, not a cross-seed median." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_026", + "metadata": {}, + "outputs": [], + "source": [ + "# Single-seed (seed=42) AUCs observed on the as-shipped\n", + "# intermediate bundle. Tolerances allow ±0.02 around each\n", + "# baseline (well outside numerical jitter, well inside the\n", + "# band that would let GBM(eng) silently drop below GBM(flat)).\n", + "NB02_TARGETS = {\n", + " \"lr_flat_auc\": 0.8737,\n", + " \"gbm_flat_auc\": 0.8432,\n", + " \"lr_eng_auc\": 0.8763,\n", + " \"gbm_eng_auc\": 0.8579,\n", + " \"headline_lift_auc\": 0.0147, # GBM(eng) - GBM(flat)\n", + "}\n", + "NB02_TOLERANCES = {\n", + " \"lr_flat_auc\": 0.02,\n", + " \"gbm_flat_auc\": 0.02,\n", + " \"lr_eng_auc\": 0.02,\n", + " \"gbm_eng_auc\": 0.02,\n", + " \"headline_lift_auc\": 0.015, # tighter — sign-aware below\n", + "}\n", + "\n", + "observed = {\n", + " \"lr_flat_auc\": rows[0][\"auc\"],\n", + " \"gbm_flat_auc\": rows[1][\"auc\"],\n", + " \"lr_eng_auc\": rows[2][\"auc\"],\n", + " \"gbm_eng_auc\": rows[3][\"auc\"],\n", + " \"headline_lift_auc\": deltas[0][\"auc\"],\n", + "}\n", + "assert_within_tolerance(\n", + " observed=observed,\n", + " target=NB02_TARGETS,\n", + " tolerances=NB02_TOLERANCES,\n", + " label=\"notebook 02 metric panel (seed 42, intermediate)\",\n", + ")\n", + "assert observed[\"headline_lift_auc\"] > 0.0, (\n", + " \"GBM(eng) − GBM(flat) AUC went non-positive — relational \"\n", + " \"lift disappeared; investigate before merging.\"\n", + ")\n", + "print(\"OK — all panel metrics within tolerance and headline lift is positive.\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_027", "metadata": {}, - "source": "## 7. Honest takeaway\n\nThe headline number is **GBM(eng) − GBM(flat) AUC**. On\nseed 42, intermediate tier, this lift is positive and\nnon-trivial: HistGBM closes a meaningful share of the gap to\nLR-on-flat once it has the engineered relational features to\nchew on.\n\nHowever the lift does **not** flip the sign of the GBM-vs-LR\ncomparison: GBM(eng) is still slightly below LR(flat). This is\nthe same v1 finding documented in\n`release/validation/validation_report.md` (gate **G7.4.4**)\nand in the dataset card: the v1 snapshot is dominated by\nroughly-linear signal, and HistGBM doesn't consistently beat\nLR on it. Engineered relational features narrow the gap; they\ndon't yet erase it.\n\nTwo takeaways for downstream users:\n\n1. **Joins on the public bundle are leakage-safe by\n construction.** Section 3 above is the full proof. You can\n aggregate any of the four event tables without policing the\n horizon yourself.\n2. **Bring your own non-linearities.** If you can find a\n feature engineering choice (cross-table interactions, tree\n kernels, learned embeddings) that flips the GBM-vs-LR sign,\n please [open a `realism` issue](../docs/release/break_me_guide.md)\n (template lands in PR 6.3) — that's a finding worth seeing.\n\n## Next\n\n- **Notebook 03** *(PR 6.2)* — leakage and time-window\n walkthrough, including the deliberate `total_touches_all`\n trap.\n- **Notebook 04** *(PR 6.2)* — value-aware ranking,\n calibration, and cohort-shift evaluation." + "source": "## 8. Honest takeaway\n\nOn seed 42 the GBM(eng) − GBM(flat) AUC lift is small\n(+0.0147). Cross-seed variance for `gbm_auc` on this bundle\nis ~0.027 (see `release/validation/validation_report.json`,\n`tiers.intermediate.spreads.gbm_auc`), so a single-seed lift\nof this size is **suggestive, not conclusive**. Confirming a\nreal signal needs a seed sweep — see the cohort-shift / seed\nharness coming in PR 6.2's notebook 04.\n\nThe lift also does **not** flip the sign of the GBM-vs-LR\ncomparison: GBM(eng) is still slightly below LR(flat). This\nis the same v1 finding documented in\n`release/validation/validation_report.md` (gate **G7.4.4**)\nand the dataset card: the v1 snapshot is dominated by\nroughly-linear signal, and HistGBM doesn't consistently beat\nLR on it. Engineered relational features narrow the gap; on\nthis seed they don't yet erase it.\n\nTwo takeaways for downstream users:\n\n1. **Joins on the public bundle are leakage-safe by\n construction.** Section 3 above is the full proof. You can\n aggregate any of the four event tables without policing the\n horizon yourself.\n2. **Bring your own non-linearities.** If a feature\n engineering choice (cross-table interactions, tree\n kernels, learned embeddings, bigger seed sweeps) flips the\n GBM-vs-LR sign reliably, that's a finding worth filing —\n the *break_me_guide* template lands in PR 6.3.\n\n## Next\n\n- **Notebook 03** *(coming in PR 6.2)* — leakage and\n time-window walkthrough, including the deliberate\n `total_touches_all` trap notebook 01 keeps and this notebook\n drops.\n- **Notebook 04** *(coming in PR 6.2)* — value-aware ranking,\n calibration, and cohort-shift evaluation with a seed sweep." } ], "metadata": { diff --git a/release/notebooks/_release_targets.json b/release/notebooks/_release_targets.json new file mode 100644 index 0000000..7ed4cdb --- /dev/null +++ b/release/notebooks/_release_targets.json @@ -0,0 +1,10 @@ +{ + "_doc": "Cross-seed-median metric values from release/validation/validation_report.json, sliced to the metrics the release notebooks pin via assert_within_tolerance. Audited against the report by tests/release/notebooks/test_release_targets_match_report.py — if you change a value here, the test will fail unless the corresponding median in the validation report changes to match.", + "intermediate": { + "brier_score": 0.10963449613199748, + "gbm_auc": 0.875461913160326, + "lr_auc": 0.8858759553203998, + "lr_average_precision": 0.5752148545119874, + "top_decile_rate": 0.5866666666666667 + } +} diff --git a/scripts/_release_notebook_common.py b/scripts/_release_notebook_common.py new file mode 100644 index 0000000..a3daf8d --- /dev/null +++ b/scripts/_release_notebook_common.py @@ -0,0 +1,76 @@ +"""Shared scaffolding for the ``scripts/build_release_notebook_*.py`` builders. + +The release-notebook builders share ~80% of their plumbing: imports, +``md`` / ``code`` cell wrappers, the metadata block, and the +``write_notebook`` step that serializes JSON and shells out to +``ruff format`` so builder output matches the project's pre-commit hook +byte-for-byte. Putting them here keeps each per-notebook builder +focused on its cell list. + +Cell IDs are assigned deterministically (``cell_000``, ``cell_001``, +...) so re-running a builder produces a byte-identical ``.ipynb`` — +audit-artifact-sync, the same invariant PR 4.1 / 5.1 / 5.2 use for +``release/`` artifacts. Nondeterministic IDs are the default in +``nbformat.v4.new_*_cell``; without an explicit override every build +diverges and the byte-equality test in +``tests/scripts/test_release_notebook_builders.py`` fails. +""" + +from __future__ import annotations + +import json +import subprocess +from pathlib import Path +from textwrap import dedent + +import nbformat as nbf + +_KERNEL_METADATA = { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3", + }, + "language_info": { + "name": "python", + "version": "3.11", + }, +} + + +def md(source: str) -> nbf.NotebookNode: + """Markdown cell with ``source`` dedented and surrounding blank lines stripped.""" + return nbf.v4.new_markdown_cell(dedent(source).strip("\n")) + + +def code(source: str) -> nbf.NotebookNode: + """Cleared code cell (no execution_count, no outputs).""" + cell = nbf.v4.new_code_cell(dedent(source).strip("\n")) + cell["execution_count"] = None + cell["outputs"] = [] + return cell + + +def assemble_notebook(cells: list[nbf.NotebookNode]) -> nbf.NotebookNode: + """Build a notebook from ``cells`` with deterministic cell IDs and stable metadata.""" + nb = nbf.v4.new_notebook() + for index, cell in enumerate(cells): + cell["id"] = f"cell_{index:03d}" + nb.cells = cells + nb.metadata = _KERNEL_METADATA + return nb + + +def write_notebook(out_path: Path, nb: nbf.NotebookNode) -> None: + """Write ``nb`` to ``out_path`` then run ``ruff format`` for hook-parity. + + Without the post-write format step a contributor running the builder + would see pre-commit reformat their notebook on commit, defeating the + audit-artifact-sync invariant. ``json.dumps`` parameters are pinned + so the pre-format bytes are deterministic. + """ + out_path.parent.mkdir(parents=True, exist_ok=True) + text = json.dumps(nb, indent=1, sort_keys=True, ensure_ascii=False) + out_path.write_text(text + "\n", encoding="utf-8") + subprocess.run(["ruff", "format", str(out_path)], check=True) # noqa: S603, S607 + print(f"wrote {out_path}") diff --git a/scripts/build_public_release.py b/scripts/build_public_release.py index f597d7b..5ccb152 100644 --- a/scripts/build_public_release.py +++ b/scripts/build_public_release.py @@ -125,6 +125,16 @@ def main() -> None: "Default: wall-clock now. Use this for reproducible bundles." ), ) + parser.add_argument( + "--tier", + choices=[name for name, _, _ in BUNDLES] + ["all"], + default="all", + help=( + "Build only the named tier (default: all four). " + "Useful for CI jobs that need a single bundle — e.g. the " + "release-notebooks job only needs ``intermediate``." + ), + ) args = parser.parse_args() output_root = Path(args.output_dir) @@ -135,7 +145,8 @@ def main() -> None: if license_src.exists(): shutil.copy2(license_src, output_root / "LICENSE") - for dir_name, exposure_mode, difficulty in BUNDLES: + selected = BUNDLES if args.tier == "all" else [b for b in BUNDLES if b[0] == args.tier] + for dir_name, exposure_mode, difficulty in selected: bundle_dir = output_root / dir_name print(f"Generating {dir_name} ({exposure_mode}, {difficulty})...", file=sys.stderr) generate_and_save( @@ -162,7 +173,7 @@ def main() -> None: # Summary print("\n=== Release Summary ===") - for dir_name, _, _ in BUNDLES: + for dir_name, _, _ in selected: bundle_dir = output_root / dir_name print_summary(bundle_dir, dir_name) diff --git a/scripts/_build_release_notebook_01.py b/scripts/build_release_notebook_01.py similarity index 69% rename from scripts/_build_release_notebook_01.py rename to scripts/build_release_notebook_01.py index 8868097..7388fe1 100644 --- a/scripts/_build_release_notebook_01.py +++ b/scripts/build_release_notebook_01.py @@ -2,41 +2,38 @@ Run from the repository root:: - python scripts/_build_release_notebook_01.py + python scripts/build_release_notebook_01.py -Produces a cleared notebook (no execution_count, no outputs) with stable -metadata. Re-running yields a byte-identical file — same audit-artifact- -sync pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts. +Cells are assigned deterministic IDs by ``_release_notebook_common`` so +re-running yields a byte-identical file — same audit-artifact-sync +pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts. The byte +equality is enforced in CI by ``tests/scripts/test_release_notebook_builders.py``. """ from __future__ import annotations -import json -import subprocess +import sys from pathlib import Path -from textwrap import dedent -import nbformat as nbf +# Make ``scripts/`` importable regardless of how this file is loaded +# (mirrors ``scripts/package_hf_release.py``'s sys.path dance). +sys.path.insert(0, str(Path(__file__).resolve().parent)) + +import nbformat as nbf # noqa: E402 — must follow sys.path insert +from _release_notebook_common import ( # noqa: E402 — must follow sys.path insert + assemble_notebook, + code, + md, + write_notebook, +) OUT = ( Path(__file__).resolve().parents[1] / "release" / "notebooks" / "01_baseline_lead_scoring.ipynb" ) -def md(source: str) -> nbf.NotebookNode: - return nbf.v4.new_markdown_cell(dedent(source).strip("\n")) - - -def code(source: str) -> nbf.NotebookNode: - cell = nbf.v4.new_code_cell(dedent(source).strip("\n")) - cell["execution_count"] = None - cell["outputs"] = [] - return cell - - -def build() -> nbf.NotebookNode: - nb = nbf.v4.new_notebook() - nb.cells = [ +def cells() -> list[nbf.NotebookNode]: + return [ md( """ # Notebook 01 — Baseline Lead Scoring @@ -48,7 +45,7 @@ def build() -> nbf.NotebookNode: baselines on the snapshot-safe public bundle, and verify they reproduce the cross-seed-median metrics in [`release/validation/validation_report.md`](../validation/validation_report.md) - within the **±0.05** tolerance fixed by acceptance gate **G13.2**. + within the per-metric tolerances fixed by acceptance gate **G13.2**. **Public path discipline (G13.3).** This notebook reads only from the public bundle at `release/intermediate/`. The instructor @@ -58,15 +55,12 @@ def build() -> nbf.NotebookNode: never depend on instructor-only artefacts. """ ), - md( - """ - ## 1. Setup - """ - ), + md("## 1. Setup"), code( """ from __future__ import annotations + import json import sys from pathlib import Path @@ -85,7 +79,11 @@ def build() -> nbf.NotebookNode: from sklearn.preprocessing import OneHotEncoder, StandardScaler sys.path.insert(0, str(Path.cwd())) - from _notebook_utils import assert_within_tolerance, precision_at_k + from _notebook_utils import ( + assert_within_tolerance, + precision_at_k, + top_decile_rate, + ) SEED = 42 BUNDLE = Path("../intermediate") # public student bundle @@ -96,26 +94,44 @@ def build() -> nbf.NotebookNode: """ ## 2. Reproduction targets - We pin the cross-seed median metrics from - `release/validation/validation_report.md` (intermediate tier). - Each is a Logistic Regression score on the test split unless - otherwise noted; ranking-based metrics use stable argsort on the - LR probability so ties resolve identically to the validator. + We pin the cross-seed-median metrics for the *intermediate* tier + (seeds 42–46) from `release/validation/validation_report.json`. + The targets live in a sibling file + (`release/notebooks/_release_targets.json`) so they can't drift + from the validation report without an audit-sync test failure + in CI. + + **Per-metric tolerances** are tighter than a flat 5 % band: the + cross-seed standard deviation in the report is well under 0.02 + on AUC and Brier, and a flat ±0.05 would let a regression slip + through. Average-precision and the small-`k` `top_decile_rate` + stay at ±0.05 because their seed-to-seed variance is larger. """ ), code( """ - # Cross-seed medians from release/validation/validation_report.md - # (intermediate tier; seeds 42-46). See ``$.tiers.intermediate.medians`` - # in the JSON sibling for the source of truth. + with (Path.cwd() / "_release_targets.json").open() as fh: + targets = json.load(fh)["intermediate"] + + # Re-key the validation report's metric names into the metric + # names this notebook prints below, so the gate compares apples + # to apples. VALIDATION_REPORT_TARGETS = { - "lr_auc": 0.8859, - "gbm_auc": 0.8755, - "lr_average_precision": 0.5752, - "lr_brier": 0.1096, - "lr_precision_at_100": 0.59, + "lr_auc": targets["lr_auc"], + "gbm_auc": targets["gbm_auc"], + "lr_average_precision": targets["lr_average_precision"], + "lr_brier": targets["brier_score"], + "lr_top_decile_rate": targets["top_decile_rate"], } - TOLERANCE = 0.05 # G13.2 + TOLERANCES = { + "lr_auc": 0.02, # G13.2 — tighter than a flat 5% + "gbm_auc": 0.02, + "lr_average_precision": 0.05, # higher seed variance + "lr_brier": 0.02, + "lr_top_decile_rate": 0.05, # small-k variance + } + for k, v in VALIDATION_REPORT_TARGETS.items(): + print(f" target {k:<24s} {v:.4f} (tol ±{TOLERANCES[k]:.2f})") """ ), md( @@ -131,8 +147,6 @@ def build() -> nbf.NotebookNode: ), code( """ - import json - train = pd.read_parquet(BUNDLE / "tasks" / TASK / "train.parquet") valid = pd.read_parquet(BUNDLE / "tasks" / TASK / "valid.parquet") test = pd.read_parquet(BUNDLE / "tasks" / TASK / "test.parquet") @@ -162,13 +176,23 @@ def build() -> nbf.NotebookNode: """ ## 4. Feature selection - The feature dictionary flags one column as a deliberate **leakage - trap**: `total_touches_all` counts touches over the full 90-day - horizon (post-snapshot data). We drop it from the baseline so - the comparison against the validation report is honest. - - Notebook 03 (the leakage walkthrough, shipping in PR 6.2) exists - specifically to show what happens if you keep it. + We use the **same feature set as `release/validation/validation_report.json`** + so the gate in section 7 is a real reproduction check rather + than a related-but-different number. That means we drop only + the IDs and the label — every other column in `train` (including + `total_touches_all`, the documented leakage trap) goes into the + pipeline. + + **About `total_touches_all`.** The feature dictionary flags it + with `leakage_risk = True`: it counts touches over the full + 90-day horizon, which is post-snapshot data. The validation + report keeps it in the panel anyway because (a) its standalone + AUC is barely above 0.55 (see the *post_snapshot_aggregates* + baseline column in the report) and (b) the report exists to + measure the v1 dataset's *as-shipped* difficulty, leakage trap + included. **Notebook 03** *(coming in PR 6.2)* walks through + what dropping the trap does to performance and how to detect + similar traps from feature audits alone. """ ), code( @@ -178,7 +202,8 @@ def build() -> nbf.NotebookNode: feat_dict["leakage_risk"].astype(bool), "name" ].tolist() ID_COLS = ["account_id", "contact_id", "lead_id", "lead_created_at"] - EXCLUDE = set(ID_COLS + trap_cols + [TASK]) + # Mirrors ``release_quality._partition_columns`` — IDs + label only. + EXCLUDE = set(ID_COLS + [TASK]) feature_cols = [c for c in train.columns if c not in EXCLUDE] cat_cols = [ @@ -191,7 +216,7 @@ def build() -> nbf.NotebookNode: ] num_cols = [c for c in feature_cols if c not in cat_cols] - print(f"Leakage-trap columns dropped: {trap_cols}") + print(f"Leakage-trap columns kept (see narrative above): {trap_cols}") print(f"Categorical features ({len(cat_cols)}): {cat_cols}") print(f"Numeric features ({len(num_cols)}): {num_cols}") """ @@ -219,8 +244,8 @@ def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame: x_train = _sanitize_categoricals(train[feature_cols]) x_test = _sanitize_categoricals(test[feature_cols]) - y_train = train[TASK].astype("boolean").fillna(False).astype(int).values - y_test = test[TASK].astype("boolean").fillna(False).astype(int).values + y_train = train[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + y_test = test[TASK].astype("boolean").fillna(False).astype(int).to_numpy() numeric_t = Pipeline( [("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())] @@ -240,11 +265,7 @@ def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame: ) """ ), - md( - """ - ## 6. Train baselines and score the test split - """ - ), + md("## 6. Train baselines and score the test split"), code( """ lr_pipe = Pipeline( @@ -276,6 +297,9 @@ def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame: "gbm_auc": float(roc_auc_score(y_test, gbm_probs)), "lr_average_precision": float(average_precision_score(y_test, lr_probs)), "lr_brier": float(brier_score_loss(y_test, lr_probs)), + "lr_top_decile_rate": top_decile_rate(lr_probs, y_test), + # Print-only; not pinned (the validation report tracks + # ``top_decile_rate`` instead, which we gate above). "lr_precision_at_50": precision_at_k(lr_probs, y_test, 50), "lr_precision_at_100": precision_at_k(lr_probs, y_test, 100), "lr_precision_at_200": precision_at_k(lr_probs, y_test, 200), @@ -289,10 +313,10 @@ def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame: ## 7. Tolerance check (G13.2) The notebook's printed metrics must match the cross-seed medians - in `validation_report.md` to within ±0.05. If a future change - breaks this, the assertion below fails — and CI catches it, - because the same cell runs under `nbclient` in the - `notebooks` job. + in `validation_report.json` to within the per-metric tolerances + declared in section 2. If a future change breaks this, the + assertion below fails — and CI catches it, because the same + cell runs under `nbclient` in the `notebooks` job. """ ), code( @@ -300,10 +324,10 @@ def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame: assert_within_tolerance( observed=metrics, target=VALIDATION_REPORT_TARGETS, - tolerances=TOLERANCE, - label="notebook 01 vs validation_report.md (intermediate tier)", + tolerances=TOLERANCES, + label="notebook 01 vs validation_report.json (intermediate tier)", ) - print("OK — all reported metrics are within ±0.05 of the validation report medians.") + print("OK — all gated metrics are within their per-metric tolerance.") """ ), md( @@ -389,41 +413,19 @@ def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame: - **Notebook 02** — engineer features by joining the snapshot- safe relational tables under `release/intermediate/tables/`, then measure the lift over the flat-CSV LR baseline above. - - **Notebook 03** *(PR 6.2)* — leakage and time-window + - **Notebook 03** *(coming in PR 6.2)* — leakage and time-window walkthrough; works through what `total_touches_all` does to your AUC if you forget to drop it. - - **Notebook 04** *(PR 6.2)* — value-aware ranking + - **Notebook 04** *(coming in PR 6.2)* — value-aware ranking (`expected_acv` × P(convert)), threshold selection, and the cohort-shift stress test. """ ), ] - nb.metadata = { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3", - }, - "language_info": { - "name": "python", - "version": "3.11", - }, - } - return nb def main() -> None: - nb = build() - OUT.parent.mkdir(parents=True, exist_ok=True) - text = json.dumps(nb, indent=1, sort_keys=True, ensure_ascii=False) - OUT.write_text(text + "\n", encoding="utf-8") - # Run ruff format on the emitted notebook so the builder's output - # matches the project's pre-commit hook byte-for-byte. Without this - # step a contributor running the builder would see pre-commit - # reformat their notebook on commit, defeating the audit-artifact- - # sync invariant. - subprocess.run(["ruff", "format", str(OUT)], check=True) # noqa: S603, S607 - print(f"wrote {OUT}") + write_notebook(OUT, assemble_notebook(cells())) if __name__ == "__main__": diff --git a/scripts/_build_release_notebook_02.py b/scripts/build_release_notebook_02.py similarity index 62% rename from scripts/_build_release_notebook_02.py rename to scripts/build_release_notebook_02.py index 2519b2a..049797f 100644 --- a/scripts/_build_release_notebook_02.py +++ b/scripts/build_release_notebook_02.py @@ -2,20 +2,27 @@ Run from the repository root:: - python scripts/_build_release_notebook_02.py + python scripts/build_release_notebook_02.py -Produces a cleared notebook (no execution_count, no outputs) with stable -metadata. Re-running yields a byte-identical file. +Cells are assigned deterministic IDs by ``_release_notebook_common`` so +re-running yields a byte-identical file — same audit-artifact-sync +pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts. """ from __future__ import annotations -import json -import subprocess +import sys from pathlib import Path -from textwrap import dedent -import nbformat as nbf +sys.path.insert(0, str(Path(__file__).resolve().parent)) + +import nbformat as nbf # noqa: E402 — must follow sys.path insert +from _release_notebook_common import ( # noqa: E402 — must follow sys.path insert + assemble_notebook, + code, + md, + write_notebook, +) OUT = ( Path(__file__).resolve().parents[1] @@ -25,20 +32,8 @@ ) -def md(source: str) -> nbf.NotebookNode: - return nbf.v4.new_markdown_cell(dedent(source).strip("\n")) - - -def code(source: str) -> nbf.NotebookNode: - cell = nbf.v4.new_code_cell(dedent(source).strip("\n")) - cell["execution_count"] = None - cell["outputs"] = [] - return cell - - -def build() -> nbf.NotebookNode: - nb = nbf.v4.new_notebook() - nb.cells = [ +def cells() -> list[nbf.NotebookNode]: + return [ md( """ # Notebook 02 — Relational Feature Engineering @@ -56,11 +51,13 @@ def build() -> nbf.NotebookNode: 1. Loading the seven public tables. 2. Verifying the snapshot-safe contract inline (the teachable - moment — see the contract, don't just read about it). - 3. Engineering four relational features. + moment — see the contract pass, don't just read about it). + 3. Engineering four relational features, with train-only + discipline for any aggregation that crosses splits. 4. Training a HistGBM on `flat ∪ engineered` columns. 5. Reporting the AUC / AP / Brier / P@K delta vs the flat-CSV - baseline from notebook 01. + baseline, with a tolerance gate that fails CI if the + headline lift regresses. **Public path discipline (G13.3).** This notebook reads only from `release/intermediate/`. The instructor companion @@ -70,13 +67,16 @@ def build() -> nbf.NotebookNode: purpose (`customers`, `subscriptions`) live only in the instructor companion because their mere existence reconstructs the label. + + **Leakage-trap discipline.** Unlike notebook 01 (which + reproduces the validation report's panel verbatim and + therefore keeps `total_touches_all`), notebook 02 **drops** + the trap from the flat baseline. We're teaching feature + engineering here; mixing a known-leaky column into the + "before" panel would muddy the relational lift attribution. """ ), - md( - """ - ## 1. Setup - """ - ), + md("## 1. Setup"), code( """ from __future__ import annotations @@ -100,7 +100,7 @@ def build() -> nbf.NotebookNode: from sklearn.preprocessing import OneHotEncoder, StandardScaler sys.path.insert(0, str(Path.cwd())) - from _notebook_utils import precision_at_k + from _notebook_utils import assert_within_tolerance, precision_at_k SEED = 42 BUNDLE = Path("../intermediate") # public student bundle @@ -151,9 +151,13 @@ def build() -> nbf.NotebookNode: ## 3. Verify the snapshot-safe contract For every event-table row joined back to its lead, assert - `timestamp <= lead_created_at + snapshot_day`. This is the - invariant that makes any join you can write leakage-safe — and - it's worth seeing it pass before relying on it. + `timestamp <= lead_created_at + snapshot_day`. The reported + **headroom** is the *minimum* gap any event row leaves between + its timestamp and the snapshot cutoff — a non-negative number + when the contract holds. Showing the actual minimum (rather + than just "all <= cutoff") makes the receipt honest: if a + future regeneration ever shaves the contract close, you'll see + the headroom shrink before the assertion fires. """ ), code( @@ -172,7 +176,7 @@ def build() -> nbf.NotebookNode: ] for tbl, ts_col in EVENT_TABLES: - df = tables[tbl][[ "lead_id", ts_col]].merge( + df = tables[tbl][["lead_id", ts_col]].merge( cutoff[["lead_id", "cutoff"]], on="lead_id", how="left" ) df[ts_col] = pd.to_datetime(df[ts_col]) @@ -180,8 +184,12 @@ def build() -> nbf.NotebookNode: assert len(violations) == 0, ( f"{tbl}.{ts_col}: {len(violations)} rows past snapshot cutoff" ) - max_offset_days = (df[ts_col] - df["lead_created_at" if False else "cutoff"]).max() - print(f" {tbl}.{ts_col}: {len(df):>6,} rows, all <= cutoff (max offset 0.0 days)") + min_headroom = (df["cutoff"] - df[ts_col]).min() + min_headroom_days = float(min_headroom.total_seconds()) / 86400.0 + print( + f" {tbl}.{ts_col}: {len(df):>6,} rows; " + f"min headroom under cutoff: {min_headroom_days:6.2f} days" + ) print() print("OK — every event row in every public event table satisfies") @@ -194,21 +202,21 @@ def build() -> nbf.NotebookNode: We build four relational features. Each one starts as a per-lead aggregate over a single snapshot-safe table, then - joins back into the per-lead snapshot. + joins back into the per-lead snapshot. Aggregations that pool + across leads (account-level density, target encoding) are + **fit on the train split only** and applied to test via + join — same train-only discipline that prevents target leakage + in mean encoding. | # | Feature | Source table(s) | Aggregation | |---|---|---|---| | 1 | `touches_ch_*` (3 cols) | `touches` | per-lead × `touch_channel` count | - | 2 | `account_avg_touches_per_lead` | `touches`, `leads` | account-level rollup, then merge back | + | 2 | `account_avg_touches_per_lead` | `touches`, `leads`, train lead set | account-level rollup over **train leads only**, then merge back | | 3 | `days_since_last_activity` | `sales_activities`, `leads` | per-lead recency at snapshot cutoff | - | 4 | `industry_target_encoding_train` | `accounts`, `train` | mean-target encoding **fit on train only** | - """ - ), - md( - """ - ### 4.1 Touch-channel breakdown + | 4 | `industry_target_encoding_train` | `accounts`, train | mean-target encoding **fit on train only** | """ ), + md("### 4.1 Touch-channel breakdown"), code( """ touches = tables["touches"] @@ -227,15 +235,23 @@ def build() -> nbf.NotebookNode: How active is the account this lead belongs to, on average? An account with many leads and many touches per lead is a structurally different prospect than a one-touch account. + + **Train-only.** We compute the account-level rollup using only + train leads' touches. Test leads in train-only-empty accounts + fall back to 0 via `fillna`. This avoids letting test rows + influence the train feature distribution — same discipline + applied to mean-target encoding in 4.4. """ ), code( """ + train_lead_ids = set(train["lead_id"].tolist()) tch_with_acct = touches.merge( tables["leads"][["lead_id", "account_id"]], on="lead_id", how="left" ) + tch_train = tch_with_acct[tch_with_acct["lead_id"].isin(train_lead_ids)] account_density = ( - tch_with_acct.groupby("account_id") + tch_train.groupby("account_id") .agg( account_total_touches=("touch_id", "count"), account_lead_count=("lead_id", "nunique"), @@ -247,6 +263,10 @@ def build() -> nbf.NotebookNode: ) .reset_index() ) + print( + f"account_density rows: {len(account_density):,} " + f"(accounts represented in train touches)" + ) account_density.head() """ ), @@ -297,17 +317,16 @@ def build() -> nbf.NotebookNode: print(f" fallback global mean: {tgt_enc_global_mean:.3f}") """ ), - md( - """ - ### 4.5 Stitch features onto train and test - """ - ), + md("### 4.5 Stitch features onto train and test"), code( """ ENGINEERED_NUMERIC = ( list(channel_counts.columns) - + ["account_avg_touches_per_lead", "days_since_last_activity", - "industry_target_encoding_train"] + + [ + "account_avg_touches_per_lead", + "days_since_last_activity", + "industry_target_encoding_train", + ] ) def attach_engineered(df: pd.DataFrame) -> pd.DataFrame: @@ -335,7 +354,10 @@ def attach_engineered(df: pd.DataFrame) -> pd.DataFrame: train_eng = attach_engineered(train) test_eng = attach_engineered(test) - print(f"train_eng shape: {train_eng.shape} ({train.shape[1]} -> {train_eng.shape[1]} cols)") + print( + f"train_eng shape: {train_eng.shape} " + f"({train.shape[1]} -> {train_eng.shape[1]} cols)" + ) print(f"new columns: {ENGINEERED_NUMERIC}") """ ), @@ -349,10 +371,15 @@ def attach_engineered(df: pd.DataFrame) -> pd.DataFrame: | Model | Features | Compares against | |---|---|---| - | LR-flat | flat snapshot only | (validation report baseline) | - | GBM-flat | flat snapshot only | LR-flat | - | LR-eng | flat ∪ engineered | LR-flat | - | GBM-eng | flat ∪ engineered | GBM-flat — the headline lift | + | LR-flat | flat snapshot, no trap | (validation report baseline) | + | GBM-flat | flat snapshot, no trap | LR-flat | + | LR-eng | flat ∪ engineered, no trap, no raw `industry` | LR-flat | + | GBM-eng | flat ∪ engineered, no trap, no raw `industry` | GBM-flat — the headline lift | + + The `+rel` pipelines drop the raw `industry` categorical + because the train-only target encoding already represents it + as a numeric column — feeding both would feed the same column + twice. """ ), code( @@ -374,8 +401,15 @@ def attach_engineered(df: pd.DataFrame) -> pd.DataFrame: ) ] num_base = [c for c in base_cols if c not in cat_base] - print(f"flat features: {len(base_cols)} (numeric={len(num_base)}, categorical={len(cat_base)})") + # Drop raw `industry` from the +rel categorical list so the LR + # pipeline doesn't see it twice (one-hot + target-encoded). + cat_eng = [c for c in cat_base if c != "industry"] + print( + f"flat features: {len(base_cols)} " + f"(numeric={len(num_base)}, categorical={len(cat_base)})" + ) print(f"engineered (numeric only): {len(ENGINEERED_NUMERIC)}") + print(f"+rel categorical list drops: {set(cat_base) - set(cat_eng)}") """ ), code( @@ -388,12 +422,19 @@ def _sanitize(df: pd.DataFrame, cat_cols: list[str]) -> pd.DataFrame: def build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline: numeric_t = Pipeline( - [("imputer", SimpleImputer(strategy="median")), - ("scaler", StandardScaler())] + [ + ("imputer", SimpleImputer(strategy="median")), + ("scaler", StandardScaler()), + ] ) cat_t = Pipeline( - [("imputer", SimpleImputer(strategy="most_frequent")), - ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))] + [ + ("imputer", SimpleImputer(strategy="most_frequent")), + ( + "encoder", + OneHotEncoder(handle_unknown="ignore", sparse_output=False), + ), + ] ) pre = ColumnTransformer( [("num", numeric_t, num_cols), ("cat", cat_t, cat_cols)], @@ -417,8 +458,8 @@ def fit_and_score( pipe.fit(_sanitize(x_train_df, cat_cols), y_train) return pipe.predict_proba(_sanitize(x_test_df, cat_cols))[:, 1] - y_train = train[TASK].astype("boolean").fillna(False).astype(int).values - y_test = test[TASK].astype("boolean").fillna(False).astype(int).values + y_train = train[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + y_test = test[TASK].astype("boolean").fillna(False).astype(int).to_numpy() base_rate = float(y_test.mean()) """ ), @@ -435,12 +476,12 @@ def fit_and_score( probs_lr_eng = fit_and_score( train_eng[base_cols + ENGINEERED_NUMERIC], test_eng[base_cols + ENGINEERED_NUMERIC], - num_eng, cat_base, model="lr", + num_eng, cat_eng, model="lr", ) probs_gbm_eng = fit_and_score( train_eng[base_cols + ENGINEERED_NUMERIC], test_eng[base_cols + ENGINEERED_NUMERIC], - num_eng, cat_base, model="gbm", + num_eng, cat_eng, model="gbm", ) """ ), @@ -449,9 +490,9 @@ def fit_and_score( def panel(probs: np.ndarray, label: str) -> dict[str, float]: return { "label": label, - "auc": roc_auc_score(y_test, probs), - "ap": average_precision_score(y_test, probs), - "brier": brier_score_loss(y_test, probs), + "auc": float(roc_auc_score(y_test, probs)), + "ap": float(average_precision_score(y_test, probs)), + "brier": float(brier_score_loss(y_test, probs)), "p@100": precision_at_k(probs, y_test, 100), } @@ -465,12 +506,20 @@ def panel(probs: np.ndarray, label: str) -> dict[str, float]: print() print(f"{'model':<18s} {'AUC':>7s} {'AP':>7s} {'Brier':>7s} {'P@100':>7s}") for r in rows: - print(f"{r['label']:<18s} {r['auc']:.4f} {r['ap']:.4f} {r['brier']:.4f} {r['p@100']:.4f}") + print( + f"{r['label']:<18s} {r['auc']:.4f} {r['ap']:.4f} " + f"{r['brier']:.4f} {r['p@100']:.4f}" + ) """ ), md( """ ## 6. Lift over the flat baseline + + *Sign convention.* `ΔAUC`, `ΔAP`, `ΔP@100`: **higher is + better** (positive = engineered features helped). `ΔBrier`: + **lower is better** (Brier is a loss, so negative ΔBrier = + improved calibration). """ ), code( @@ -478,10 +527,16 @@ def panel(probs: np.ndarray, label: str) -> dict[str, float]: def delta(eng: np.ndarray, base: np.ndarray, name: str) -> dict[str, float]: return { "label": name, - "auc": roc_auc_score(y_test, eng) - roc_auc_score(y_test, base), - "ap": average_precision_score(y_test, eng) - average_precision_score(y_test, base), - "brier": brier_score_loss(y_test, eng) - brier_score_loss(y_test, base), - "p@100": precision_at_k(eng, y_test, 100) - precision_at_k(base, y_test, 100), + "auc": float(roc_auc_score(y_test, eng) - roc_auc_score(y_test, base)), + "ap": float( + average_precision_score(y_test, eng) - average_precision_score(y_test, base) + ), + "brier": float( + brier_score_loss(y_test, eng) - brier_score_loss(y_test, base) + ), + "p@100": float( + precision_at_k(eng, y_test, 100) - precision_at_k(base, y_test, 100) + ), } deltas = [ @@ -499,22 +554,76 @@ def delta(eng: np.ndarray, base: np.ndarray, name: str) -> dict[str, float]: ), md( """ - ## 7. Honest takeaway + ## 7. Tolerance gate + + Pin the four model AUCs and the headline lift to per-metric + tolerances so a regression in any pipeline component (feature + engineering, leakage discipline, sklearn upgrade) breaks CI. + Targets and tolerances are seed-42-specific by design — this + notebook is reproducible-by-seed, not a cross-seed median. + """ + ), + code( + """ + # Single-seed (seed=42) AUCs observed on the as-shipped + # intermediate bundle. Tolerances allow ±0.02 around each + # baseline (well outside numerical jitter, well inside the + # band that would let GBM(eng) silently drop below GBM(flat)). + NB02_TARGETS = { + "lr_flat_auc": 0.8737, + "gbm_flat_auc": 0.8432, + "lr_eng_auc": 0.8763, + "gbm_eng_auc": 0.8579, + "headline_lift_auc": 0.0147, # GBM(eng) - GBM(flat) + } + NB02_TOLERANCES = { + "lr_flat_auc": 0.02, + "gbm_flat_auc": 0.02, + "lr_eng_auc": 0.02, + "gbm_eng_auc": 0.02, + "headline_lift_auc": 0.015, # tighter — sign-aware below + } + + observed = { + "lr_flat_auc": rows[0]["auc"], + "gbm_flat_auc": rows[1]["auc"], + "lr_eng_auc": rows[2]["auc"], + "gbm_eng_auc": rows[3]["auc"], + "headline_lift_auc": deltas[0]["auc"], + } + assert_within_tolerance( + observed=observed, + target=NB02_TARGETS, + tolerances=NB02_TOLERANCES, + label="notebook 02 metric panel (seed 42, intermediate)", + ) + assert observed["headline_lift_auc"] > 0.0, ( + "GBM(eng) − GBM(flat) AUC went non-positive — relational " + "lift disappeared; investigate before merging." + ) + print("OK — all panel metrics within tolerance and headline lift is positive.") + """ + ), + md( + """ + ## 8. Honest takeaway - The headline number is **GBM(eng) − GBM(flat) AUC**. On - seed 42, intermediate tier, this lift is positive and - non-trivial: HistGBM closes a meaningful share of the gap to - LR-on-flat once it has the engineered relational features to - chew on. + On seed 42 the GBM(eng) − GBM(flat) AUC lift is small + (+0.0147). Cross-seed variance for `gbm_auc` on this bundle + is ~0.027 (see `release/validation/validation_report.json`, + `tiers.intermediate.spreads.gbm_auc`), so a single-seed lift + of this size is **suggestive, not conclusive**. Confirming a + real signal needs a seed sweep — see the cohort-shift / seed + harness coming in PR 6.2's notebook 04. - However the lift does **not** flip the sign of the GBM-vs-LR - comparison: GBM(eng) is still slightly below LR(flat). This is - the same v1 finding documented in + The lift also does **not** flip the sign of the GBM-vs-LR + comparison: GBM(eng) is still slightly below LR(flat). This + is the same v1 finding documented in `release/validation/validation_report.md` (gate **G7.4.4**) - and in the dataset card: the v1 snapshot is dominated by + and the dataset card: the v1 snapshot is dominated by roughly-linear signal, and HistGBM doesn't consistently beat - LR on it. Engineered relational features narrow the gap; they - don't yet erase it. + LR on it. Engineered relational features narrow the gap; on + this seed they don't yet erase it. Two takeaways for downstream users: @@ -522,48 +631,27 @@ def delta(eng: np.ndarray, base: np.ndarray, name: str) -> dict[str, float]: construction.** Section 3 above is the full proof. You can aggregate any of the four event tables without policing the horizon yourself. - 2. **Bring your own non-linearities.** If you can find a - feature engineering choice (cross-table interactions, tree - kernels, learned embeddings) that flips the GBM-vs-LR sign, - please [open a `realism` issue](../docs/release/break_me_guide.md) - (template lands in PR 6.3) — that's a finding worth seeing. + 2. **Bring your own non-linearities.** If a feature + engineering choice (cross-table interactions, tree + kernels, learned embeddings, bigger seed sweeps) flips the + GBM-vs-LR sign reliably, that's a finding worth filing — + the *break_me_guide* template lands in PR 6.3. ## Next - - **Notebook 03** *(PR 6.2)* — leakage and time-window - walkthrough, including the deliberate `total_touches_all` - trap. - - **Notebook 04** *(PR 6.2)* — value-aware ranking, - calibration, and cohort-shift evaluation. + - **Notebook 03** *(coming in PR 6.2)* — leakage and + time-window walkthrough, including the deliberate + `total_touches_all` trap notebook 01 keeps and this notebook + drops. + - **Notebook 04** *(coming in PR 6.2)* — value-aware ranking, + calibration, and cohort-shift evaluation with a seed sweep. """ ), ] - nb.metadata = { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3", - }, - "language_info": { - "name": "python", - "version": "3.11", - }, - } - return nb def main() -> None: - nb = build() - OUT.parent.mkdir(parents=True, exist_ok=True) - text = json.dumps(nb, indent=1, sort_keys=True, ensure_ascii=False) - OUT.write_text(text + "\n", encoding="utf-8") - # Run ruff format on the emitted notebook so the builder's output - # matches the project's pre-commit hook byte-for-byte. Without this - # step a contributor running the builder would see pre-commit - # reformat their notebook on commit, defeating the audit-artifact- - # sync invariant. - subprocess.run(["ruff", "format", str(OUT)], check=True) # noqa: S603, S607 - print(f"wrote {OUT}") + write_notebook(OUT, assemble_notebook(cells())) if __name__ == "__main__": diff --git a/tests/release/notebooks/test_notebook_utils.py b/tests/release/notebooks/test_notebook_utils.py index ad09d95..208545f 100644 --- a/tests/release/notebooks/test_notebook_utils.py +++ b/tests/release/notebooks/test_notebook_utils.py @@ -140,3 +140,37 @@ def test_assert_within_tolerance_ignores_extra_observed_keys() -> None: target={"auc": 0.886, "ap": 0.575}, tolerances=0.05, ) + + +# --------------------------------------------------------------------------- +# precision_at_k must mirror release_quality._precision_at_k +# --------------------------------------------------------------------------- + + +def test_precision_at_k_mirrors_release_quality() -> None: + """The notebook helper's docstring claims byte-equivalence with + ``leadforge.validation.release_quality._precision_at_k`` (same + stable argsort, same tie-breaking). This test pins that claim: + if either implementation drifts, the notebook's reproduction gate + silently drifts with it. + """ + from leadforge.validation import release_quality + + rng = np.random.default_rng(0) + scores = rng.random(1000) + y = (rng.random(1000) > 0.7).astype(int) + + for k in (1, 10, 50, 100, 250, 500, 999): + nbu_value = nbu.precision_at_k(scores, y, k) + rq_value = release_quality._precision_at_k(scores, y, k) + assert nbu_value == pytest.approx(rq_value), ( + f"divergence at k={k}: notebook helper {nbu_value}, release_quality {rq_value}" + ) + + # Tied scores — the convention drift this is most likely to surface. + tied_scores = np.array([0.5, 0.5, 0.5, 0.5, 0.4, 0.4, 0.3]) + tied_y = np.array([1, 0, 1, 0, 1, 1, 0]) + for k in (1, 2, 4, 6, 7): + assert nbu.precision_at_k(tied_scores, tied_y, k) == pytest.approx( + release_quality._precision_at_k(tied_scores, tied_y, k) + ) diff --git a/tests/release/notebooks/test_release_targets_match_report.py b/tests/release/notebooks/test_release_targets_match_report.py new file mode 100644 index 0000000..70f4edf --- /dev/null +++ b/tests/release/notebooks/test_release_targets_match_report.py @@ -0,0 +1,44 @@ +"""Audit-sync gate: ``release/notebooks/_release_targets.json`` must +mirror the cross-seed-median values in +``release/validation/validation_report.json``. + +Notebook 01 loads the targets file at runtime and pins its computed +metrics against it via ``assert_within_tolerance`` (G13.2). Without +this audit, the targets file could silently drift from the validation +report — at which point the notebook's "reproduction" gate stops +reproducing anything in particular. This test fails as soon as either +file moves out of step. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +_REPO_ROOT = Path(__file__).resolve().parents[3] +_TARGETS_PATH = _REPO_ROOT / "release" / "notebooks" / "_release_targets.json" +_REPORT_PATH = _REPO_ROOT / "release" / "validation" / "validation_report.json" + + +def test_release_targets_match_validation_report() -> None: + targets = json.loads(_TARGETS_PATH.read_text()) + report = json.loads(_REPORT_PATH.read_text()) + + for tier_name, tier_targets in targets.items(): + if tier_name.startswith("_"): + continue # ``_doc`` and any other meta keys + assert tier_name in report["tiers"], ( + f"targets file mentions tier {tier_name!r} which is absent from " + f"validation_report.json (known tiers: {list(report['tiers'])})" + ) + report_medians = report["tiers"][tier_name]["medians"] + for metric_name, target_value in tier_targets.items(): + assert metric_name in report_medians, ( + f"{tier_name}.{metric_name}: pinned in targets file but absent " + f"from validation_report medians" + ) + assert target_value == report_medians[metric_name], ( + f"{tier_name}.{metric_name}: targets file has {target_value} " + f"but validation_report median is {report_medians[metric_name]} — " + "regenerate the report or update _release_targets.json" + ) diff --git a/tests/scripts/test_release_notebook_builders.py b/tests/scripts/test_release_notebook_builders.py new file mode 100644 index 0000000..8898084 --- /dev/null +++ b/tests/scripts/test_release_notebook_builders.py @@ -0,0 +1,81 @@ +"""Byte-stability gate for the release-notebook builders. + +The builders advertise an audit-artifact-sync invariant (PR 4.1 / 5.1 / +5.2 pattern): re-running the builder must produce a byte-identical +``.ipynb``, and the committed file under ``release/notebooks/`` must +equal a fresh build. Without this test the invariant is wishful +thinking — ``nbformat.v4.new_*_cell`` randomises cell IDs by default, +so an unguarded builder silently diverges on every run. + +We don't run the builders by importing them — they shell out to +``ruff format`` and that's deterministic only via subprocess. Instead, +we invoke each builder twice into temp paths and diff the outputs, then +diff a third invocation against the committed file. +""" + +from __future__ import annotations + +import shutil +import subprocess +import sys +from pathlib import Path + +import pytest + +_REPO_ROOT = Path(__file__).resolve().parents[2] +_SCRIPTS_DIR = _REPO_ROOT / "scripts" +_NOTEBOOKS_DIR = _REPO_ROOT / "release" / "notebooks" + +_BUILDERS: list[tuple[str, str]] = [ + ("build_release_notebook_01.py", "01_baseline_lead_scoring.ipynb"), + ("build_release_notebook_02.py", "02_relational_feature_engineering.ipynb"), +] + + +@pytest.mark.parametrize(("builder_name", "notebook_name"), _BUILDERS) +def test_builder_is_byte_stable_and_matches_committed( + tmp_path: Path, + builder_name: str, + notebook_name: str, +) -> None: + """Run the builder twice into separate temp homes and assert the + emitted ``.ipynb`` files are byte-identical, and that the committed + notebook matches one of them. + + The builder writes to ``release/notebooks/`` by default, + so we redirect by snapshotting + restoring the committed file + around the build calls. + """ + builder_path = _SCRIPTS_DIR / builder_name + committed_path = _NOTEBOOKS_DIR / notebook_name + assert builder_path.exists(), f"missing builder: {builder_path}" + assert committed_path.exists(), f"missing committed notebook: {committed_path}" + + backup = tmp_path / "committed_backup.ipynb" + shutil.copy2(committed_path, backup) + + run_a = tmp_path / "run_a.ipynb" + run_b = tmp_path / "run_b.ipynb" + + try: + subprocess.run( # noqa: S603 — sys.executable + repo-internal builder path + [sys.executable, str(builder_path)], check=True, cwd=_REPO_ROOT + ) + shutil.copy2(committed_path, run_a) + subprocess.run( # noqa: S603 — sys.executable + repo-internal builder path + [sys.executable, str(builder_path)], check=True, cwd=_REPO_ROOT + ) + shutil.copy2(committed_path, run_b) + + assert run_a.read_bytes() == run_b.read_bytes(), ( + f"{builder_name}: two runs produced different bytes — cell IDs are " + "non-deterministic; pass an explicit ``id=`` to nbformat cell " + "constructors (see scripts/_release_notebook_common.py)" + ) + assert backup.read_bytes() == run_a.read_bytes(), ( + f"{notebook_name}: committed file does not match a fresh build of " + f"{builder_name} — re-run the builder and commit the result " + "(audit-artifact-sync, same pattern as PR 4.1 / 5.1 / 5.2)" + ) + finally: + shutil.copy2(backup, committed_path) From afdfa645cf1b885492d5f7bbd8ce92929791ee4b Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Thu, 7 May 2026 08:49:43 +0300 Subject: [PATCH 4/6] PR 6.1 Copilot review pass: builder --out, NaN-safe gate, top_decile_rate mirror, doc/code alignment MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Folds in all six unresolved Copilot review threads on this PR. C1 — ``--out`` flag on the notebook builders (was: byte-stability test mutated the committed file in place). Both builders now expose ``--out PATH`` via a shared ``_release_notebook_common.builder_arg_parser``; ``tests/scripts/test_release_notebook_builders.py`` builds twice into ``tmp_path`` and diffs against the committed file. Side effect: the test is now safe under pytest-xdist parallel execution and an interrupted run can no longer leave a corrupted ``release/notebooks/.ipynb``. C2 — ``assert_within_tolerance`` rejects silent-pass paths (was: NaN observed/target slipped through because ``NaN > tol`` is ``False``; per-metric tolerance maps missing keys defaulted to ``+inf``, disabling the gate for those metrics). Both paths now fail with clear messages; the docstring describes the contract. C3 — ``top_decile_rate`` gets the mirror test ``precision_at_k`` already had. Both already use the same ``max(1, int(round(n * 0.1)))`` rule as ``release_quality._top_decile_rate``, so no impl change — the test locks that in across small-``n`` (banker's rounding) and the exact ``n_test = 750`` the intermediate tier ships. Drift between ``_notebook_utils`` and ``release_quality`` would now break CI. C4 — ``.agent-plan.md`` and the GitHub PR body both rewritten to match reality after rev-2: notebook 01 *keeps* the trap (matches ``release_quality._partition_columns``, so the G13.2 reproduction gate is meaningful); notebook 02 *drops* the trap (clean lift attribution). The PR body also documents the rev-1 → rev-2 → rev-3 revision history. C5 — ``test_execute_notebooks.py`` docstring rewritten: the gate flow loads ``_release_targets.json`` (audit-synced against ``validation_report.json``), not ``validation_report.md`` directly, and tolerances are per-metric (not flat ±0.05). C6 — four new tests in ``test_notebook_utils.py`` cover the failure modes C2 plugs: NaN observed → fail, inf observed → fail, NaN target → fail, missing-tolerance-key → fail. Acceptance gate results on this branch: - ruff check . / ruff format --check . — clean - mypy leadforge/ — clean (82 source files) - pytest — 1255 passed, 5 skipped (publish-extra-gated) - probe_relational_leakage on intro/intermediate/advanced — 0/3 paths flag rate on every tier - verify_hash_determinism — PASS 67/67 - validate_release_candidate --no-rebuild — PASS, 0 leakage findings - BUNDLE_SCHEMA_VERSION unchanged at 5 Co-Authored-By: Claude Opus 4.7 --- .agent-plan.md | 2 +- release/notebooks/_notebook_utils.py | 43 +++++++++-- scripts/_release_notebook_common.py | 25 +++++++ scripts/build_release_notebook_01.py | 9 ++- scripts/build_release_notebook_02.py | 9 ++- .../notebooks/test_execute_notebooks.py | 11 ++- .../release/notebooks/test_notebook_utils.py | 71 +++++++++++++++++++ .../scripts/test_release_notebook_builders.py | 65 ++++++++--------- 8 files changed, 184 insertions(+), 51 deletions(-) diff --git a/.agent-plan.md b/.agent-plan.md index d9b545a..f715f35 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -58,7 +58,7 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family - [x] PR 5.2 (self-review pass): brutal review of the first revision caught real bugs the reviewer would otherwise have to call out. Fixes folded into the PR before review: (#1) `run_packager` validate→write order — both packagers were writing the README/metadata even when validation failed; the validation gate now early-returns with `errors` populated and zero artifacts on disk; new tests exercise the no-write path on both sides. (#2) Instructor README was inlining the public 3-tier README for a 1-tier dataset; replaced with a dedicated `INSTRUCTOR_BODY` constant that opens by linking to the public dataset, describes only the instructor-specific additions (full-horizon tables, hidden DAG, latent registry, mechanism summary), and uses the single-tier tree block. (#3) `validate_upload_dir_safe` now also rejects strict descendants of `release_dir` (e.g. `--huggingface-dir release/intro` would otherwise rmtree the intro bundle); allow-list keeps the canonical `release/{kaggle,huggingface,huggingface-instructor}` direct-children. (#4) `[publish]` extra in `pyproject.toml` (`datasets>=2.14`, `kaggle>=1.6`) makes the gated `load_dataset()` / Kaggle-CLI tests installable in a single command — closes the "G12.3/G12.4 untested in CI" gap to a one-line install. (#5) Shared-primitives extraction finished: `SOURCE_TREE_BLOCK`, `validate_readme_substitution`, `replace_file`, `replace_dir`, `load_manifest` all moved to `scripts/_release_common.py`; both packagers reduced to imports. (#6) Hand-rolled YAML renderer (60 lines + brittle quoting heuristic) replaced with `yaml.safe_dump` + a 4-line `_IndentedDumper` subclass that forces indent-2 on top-level sequences. (#7) Dead `--owner` / `--dataset-slug` CLI flags removed (PR 7.2 will add them when actually needed). (#8) `assemble_upload_dir` now takes `rendered_readme` as a parameter and writes it itself; the public name no longer lies about producing a complete tree. (#9) `build_config_for_tier` made pure (no I/O); `_assert_tier_dir_exists` does the cheap manifest-stat preflight. (#10) `--default-config` with `--variant=instructor` now errors instead of silently ignoring. (#11) Instructor tree-diagram drops the hardcoded "9 tables" claim. (#13–#16) Visual cleanups (duplicate divider, ruff-split imports, `COVER_IMAGE_FILENAME`-vs-`Path.name` redundancy, speculative comment about HF split rename). (#17) Test cruft removed (unused `tmp_path`, dead `tag_lines`); em-dash YAML round-trip parametrised for the instructor `pretty_name`. Net: 1223/1223 tests pass + 5 gated skips (4 `datasets`-SDK round-trip + 1 Kaggle `kaggle`-SDK from PR 5.1); ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR doesn't touch the bundle shape). ### Phase 6 — Notebook sequence + adversarial framing -- [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added. Notebook 01 trains LR + HistGBM on the public `intermediate` bundle, drops the `total_touches_all` leakage trap (referenced for pedagogy + forward-pointed to notebook 03), and pins its metrics to `release/validation/validation_report.md` within ±0.05 via a new `assert_within_tolerance` helper (G13.2 acceptance gate). Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline, demonstrates four legal joins (touch-channel breakdown, account-level density, sales-activity recency, train-only industry target encoding), then trains LR + GBM on flat-baseline-only and flat+relational features and prints a 4-row metric panel + delta panel. Honest takeaway cell quotes G7.4.4: GBM(eng)−GBM(flat) is the headline lift, GBM(eng)−LR(flat) stays slightly negative — consistent with the v1 finding that flat-CSV LR is hard to beat. Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` mirrors `release_quality._precision_at_k` (stable argsort), `top_decile_rate` and `assert_within_tolerance` round it out. G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the public bundles via `scripts/build_public_release.py` then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised over the two notebooks, gated on bundles-present using the same skip pattern as `test_package_*`). G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`. Builders (`scripts/_build_release_notebook_{01,02}.py`) emit deterministic byte-for-byte notebook JSON (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree. Net: 1246/1246 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. +- [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added. Notebook 01 trains LR + HistGBM on the public `intermediate` bundle using the **same feature set as the validation report** (drops only IDs and the label, mirrors `release_quality._partition_columns`), so the G13.2 reproduction gate compares apples to apples. This means notebook 01 **keeps** `total_touches_all` (the documented leakage trap) — narrative cell calls it out explicitly and forward-points to notebook 03 (PR 6.2) which dissects what dropping the trap does to performance. Notebook 02 by contrast **drops** the trap from the flat baseline so the relational lift attribution stays clean (its goal is teaching feature engineering, not reproducing the report). Targets are loaded at runtime from `release/notebooks/_release_targets.json` (audit-synced against `release/validation/validation_report.json` by `tests/release/notebooks/test_release_targets_match_report.py`); per-metric tolerances replace the original flat ±0.05 (AUC/Brier ±0.02, AP / top-decile ±0.05). Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on **train leads only**, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, prints a 4-row metric panel + delta panel, and pins the four model AUCs and the headline `GBM(eng) − GBM(flat)` lift via `assert_within_tolerance` (sign-aware `assert lift > 0` on top of the absolute tolerance). Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive (the cross-seed `gbm_auc` spread on this bundle is ~0.027); seed-sweep harness lands in PR 6.2's notebook 04. Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` and `top_decile_rate` mirror `release_quality._precision_at_k` / `_top_decile_rate` (locked in by mirror tests), and `assert_within_tolerance` is hardened against silent passes on non-finite metrics or incomplete per-metric tolerance maps. G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the intermediate bundle via `python scripts/build_public_release.py release --tier intermediate` (only tier the notebooks need) then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised, gated on bundles-present). G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`. Builders (`scripts/build_release_notebook_{01,02}.py`, sharing `scripts/_release_notebook_common.py`) emit deterministic byte-for-byte notebook JSON via explicit `cell_NNN` IDs (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2, locked in by `tests/scripts/test_release_notebook_builders.py` which builds twice into `tmp_path` via the new `--out PATH` flag and diffs against the committed file without ever touching the working tree) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree. Net: 1250/1250 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. - [ ] `release/notebooks/{03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb` - [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml` - [ ] `docs/release/{break_me_guide,v2_decision_log}.md` diff --git a/release/notebooks/_notebook_utils.py b/release/notebooks/_notebook_utils.py index 4033a94..c2baf71 100644 --- a/release/notebooks/_notebook_utils.py +++ b/release/notebooks/_notebook_utils.py @@ -13,6 +13,7 @@ from __future__ import annotations +import math from collections.abc import Mapping import numpy as np @@ -54,33 +55,61 @@ def assert_within_tolerance( cross-seed-median values from ``release/validation/validation_report.md`` and fails loudly if the notebook drifts out of band. + The gate is intentionally strict about silent-pass paths: + + * Non-finite ``observed`` or ``target`` values fail (rather than + slipping through because ``NaN > tol`` evaluates ``False``). + * When ``tolerances`` is a mapping, every key in ``target`` must + have an explicit tolerance — a missing entry is treated as a + configuration error and aborts the gate up front, instead of + defaulting to ``+inf`` and silently disabling the check for + that metric. + Args: observed: Notebook-computed metrics, keyed by metric name. target: Reference values from the validation report. - tolerances: Either a per-metric tolerance map, or a single float - applied to every metric (G13.2's default is 0.05). + tolerances: Either a per-metric tolerance map (every key in + ``target`` must be present), or a single float applied to + every metric (G13.2's default is 0.05). label: Human-readable name for the metric panel; appears in the error message so the failing assertion identifies its source. Raises: AssertionError: when any metric falls outside its tolerance, - or when ``observed`` is missing a key listed in ``target``. + ``observed`` is missing a key listed in ``target``, an + ``observed`` / ``target`` value is non-finite, or + ``tolerances`` is a mapping that omits a required key. """ if isinstance(tolerances, int | float): per_key: Mapping[str, float] = {k: float(tolerances) for k in target} else: per_key = tolerances + missing_tolerances = [k for k in target if k not in per_key] + if missing_tolerances: + raise AssertionError( + f"{label}: tolerances mapping is missing entries for " + f"target metrics: {sorted(missing_tolerances)}. " + "Falling back to +inf would silently disable the gate " + "for these metrics; declare an explicit tolerance per key." + ) failures: list[str] = [] for key, target_value in target.items(): if key not in observed: failures.append(f" {key}: missing from observed metrics") continue - tol = float(per_key.get(key, float("inf"))) - diff = abs(float(observed[key]) - float(target_value)) + observed_f = float(observed[key]) + target_f = float(target_value) + if not (math.isfinite(observed_f) and math.isfinite(target_f)): + failures.append( + f" {key}: non-finite value (observed={observed_f}, " + f"target={target_f}) — gate refuses to silently pass NaN/inf" + ) + continue + tol = float(per_key[key]) + diff = abs(observed_f - target_f) if diff > tol: failures.append( - f" {key}: observed={float(observed[key]):.4f} " - f"target={float(target_value):.4f} " + f" {key}: observed={observed_f:.4f} target={target_f:.4f} " f"|diff|={diff:.4f} > tol={tol:.4f}" ) if failures: diff --git a/scripts/_release_notebook_common.py b/scripts/_release_notebook_common.py index a3daf8d..ef25263 100644 --- a/scripts/_release_notebook_common.py +++ b/scripts/_release_notebook_common.py @@ -14,10 +14,17 @@ ``nbformat.v4.new_*_cell``; without an explicit override every build diverges and the byte-equality test in ``tests/scripts/test_release_notebook_builders.py`` fails. + +Each builder exposes an ``--out PATH`` flag (see ``builder_arg_parser``) +so the byte-stability test can build into ``tmp_path`` rather than +mutating the committed ``release/notebooks/.ipynb`` in place. +That removes a pytest-xdist race and the worktree-dirtying failure +mode under interrupted runs. """ from __future__ import annotations +import argparse import json import subprocess from pathlib import Path @@ -74,3 +81,21 @@ def write_notebook(out_path: Path, nb: nbf.NotebookNode) -> None: out_path.write_text(text + "\n", encoding="utf-8") subprocess.run(["ruff", "format", str(out_path)], check=True) # noqa: S603, S607 print(f"wrote {out_path}") + + +def builder_arg_parser(*, default_out: Path, description: str) -> argparse.ArgumentParser: + """Return the argparse parser shared by every notebook builder. + + Exposes ``--out PATH`` so callers (notably the byte-stability test) + can redirect the build into a temp directory. Defaults to the + canonical committed ``release/notebooks/.ipynb`` path so the + no-arg invocation contributors run today keeps working. + """ + parser = argparse.ArgumentParser(description=description) + parser.add_argument( + "--out", + type=Path, + default=default_out, + help=f"Path to write the notebook to (default: {default_out})", + ) + return parser diff --git a/scripts/build_release_notebook_01.py b/scripts/build_release_notebook_01.py index 7388fe1..fc6b14f 100644 --- a/scripts/build_release_notebook_01.py +++ b/scripts/build_release_notebook_01.py @@ -22,12 +22,13 @@ import nbformat as nbf # noqa: E402 — must follow sys.path insert from _release_notebook_common import ( # noqa: E402 — must follow sys.path insert assemble_notebook, + builder_arg_parser, code, md, write_notebook, ) -OUT = ( +DEFAULT_OUT = ( Path(__file__).resolve().parents[1] / "release" / "notebooks" / "01_baseline_lead_scoring.ipynb" ) @@ -425,7 +426,11 @@ def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame: def main() -> None: - write_notebook(OUT, assemble_notebook(cells())) + args = builder_arg_parser( + default_out=DEFAULT_OUT, + description="Build release/notebooks/01_baseline_lead_scoring.ipynb", + ).parse_args() + write_notebook(args.out, assemble_notebook(cells())) if __name__ == "__main__": diff --git a/scripts/build_release_notebook_02.py b/scripts/build_release_notebook_02.py index 049797f..ea2cf67 100644 --- a/scripts/build_release_notebook_02.py +++ b/scripts/build_release_notebook_02.py @@ -19,12 +19,13 @@ import nbformat as nbf # noqa: E402 — must follow sys.path insert from _release_notebook_common import ( # noqa: E402 — must follow sys.path insert assemble_notebook, + builder_arg_parser, code, md, write_notebook, ) -OUT = ( +DEFAULT_OUT = ( Path(__file__).resolve().parents[1] / "release" / "notebooks" @@ -651,7 +652,11 @@ def delta(eng: np.ndarray, base: np.ndarray, name: str) -> dict[str, float]: def main() -> None: - write_notebook(OUT, assemble_notebook(cells())) + args = builder_arg_parser( + default_out=DEFAULT_OUT, + description="Build release/notebooks/02_relational_feature_engineering.ipynb", + ).parse_args() + write_notebook(args.out, assemble_notebook(cells())) if __name__ == "__main__": diff --git a/tests/release/notebooks/test_execute_notebooks.py b/tests/release/notebooks/test_execute_notebooks.py index 7b0fcbc..24beec3 100644 --- a/tests/release/notebooks/test_execute_notebooks.py +++ b/tests/release/notebooks/test_execute_notebooks.py @@ -4,13 +4,18 @@ bottom with ``nbclient`` against the committed public release bundles. A clean run is the contract: -* G13.1 — every cell executes from a clean kernel without raising +* G13.1 — every cell executes from a clean kernel without raising. * G13.2 — notebook 01's ``assert_within_tolerance`` cell pins notebook - metrics to ``release/validation/validation_report.md`` within ±0.05 + metrics to the cross-seed-median targets in + ``release/notebooks/_release_targets.json`` (per-metric tolerances; + AUC/Brier ±0.02, AP / top-decile ±0.05). The targets file is itself + audit-synced against ``release/validation/validation_report.json`` + by ``test_release_targets_match_report.py``, so the gate + transitively pins to the validation report. * G13.3 — neither notebook touches ``release/intermediate_instructor`` or any other instructor artefact (enforced by the notebooks' own ``BUNDLE = Path("../intermediate")`` path discipline; an instructor - load would fail the public-mode manifest assertion in cell 1) + load would fail the public-mode manifest assertion in cell 1). The test is gated on the public release bundles being on disk (matches the HF/Kaggle smoke-test pattern in ``tests/scripts/test_package_*``). diff --git a/tests/release/notebooks/test_notebook_utils.py b/tests/release/notebooks/test_notebook_utils.py index 208545f..80e2f59 100644 --- a/tests/release/notebooks/test_notebook_utils.py +++ b/tests/release/notebooks/test_notebook_utils.py @@ -142,6 +142,57 @@ def test_assert_within_tolerance_ignores_extra_observed_keys() -> None: ) +# --------------------------------------------------------------------------- +# Silent-pass paths: NaN / inf observed and target, missing tolerance keys +# --------------------------------------------------------------------------- + + +def test_assert_within_tolerance_fails_on_nan_observed() -> None: + """``NaN > tol`` is ``False`` in IEEE 754; without an explicit + finiteness check a NaN-valued observed metric would slip through + the gate. + """ + with pytest.raises(AssertionError, match="non-finite"): + nbu.assert_within_tolerance( + observed={"auc": float("nan")}, + target={"auc": 0.886}, + tolerances=0.05, + ) + + +def test_assert_within_tolerance_fails_on_inf_observed() -> None: + with pytest.raises(AssertionError, match="non-finite"): + nbu.assert_within_tolerance( + observed={"auc": float("inf")}, + target={"auc": 0.886}, + tolerances=0.05, + ) + + +def test_assert_within_tolerance_fails_on_nan_target() -> None: + """A NaN target would also produce a NaN diff and bypass the gate.""" + with pytest.raises(AssertionError, match="non-finite"): + nbu.assert_within_tolerance( + observed={"auc": 0.85}, + target={"auc": float("nan")}, + tolerances=0.05, + ) + + +def test_assert_within_tolerance_rejects_incomplete_tolerance_mapping() -> None: + """A per-metric tolerances dict missing keys present in ``target`` + used to default each missing key to ``+inf``, silently disabling the + gate for those metrics. The fix is to fail up front with the list + of missing keys, treating it as a configuration error. + """ + with pytest.raises(AssertionError, match="missing entries for target metrics"): + nbu.assert_within_tolerance( + observed={"auc": 0.88, "brier": 0.10}, + target={"auc": 0.886, "brier": 0.110}, + tolerances={"auc": 0.05}, # ``brier`` deliberately missing + ) + + # --------------------------------------------------------------------------- # precision_at_k must mirror release_quality._precision_at_k # --------------------------------------------------------------------------- @@ -174,3 +225,23 @@ def test_precision_at_k_mirrors_release_quality() -> None: assert nbu.precision_at_k(tied_scores, tied_y, k) == pytest.approx( release_quality._precision_at_k(tied_scores, tied_y, k) ) + + +def test_top_decile_rate_mirrors_release_quality() -> None: + """``release_quality._top_decile_rate`` and the notebook helper share + the same ``max(1, int(round(n * 0.1)))`` k-selection rule today. + Lock that in: if either side ever changes (e.g. switches to + ``ceil`` or ``floor`` on edge cases), the gate would silently + diverge from the validation report. Includes the exact `n` the + intermediate tier ships (``n_test = 750``) and a few small `n` + where banker's rounding bites. + """ + from leadforge.validation import release_quality + + rng = np.random.default_rng(0) + for n in (5, 10, 25, 99, 100, 750, 1234): + scores = rng.random(n) + y = (rng.random(n) > 0.7).astype(int) + assert nbu.top_decile_rate(scores, y) == pytest.approx( + release_quality._top_decile_rate(scores, y) + ), f"divergence at n={n}" diff --git a/tests/scripts/test_release_notebook_builders.py b/tests/scripts/test_release_notebook_builders.py index 8898084..e2f4ece 100644 --- a/tests/scripts/test_release_notebook_builders.py +++ b/tests/scripts/test_release_notebook_builders.py @@ -7,15 +7,16 @@ thinking — ``nbformat.v4.new_*_cell`` randomises cell IDs by default, so an unguarded builder silently diverges on every run. -We don't run the builders by importing them — they shell out to -``ruff format`` and that's deterministic only via subprocess. Instead, -we invoke each builder twice into temp paths and diff the outputs, then -diff a third invocation against the committed file. +The builders accept ``--out PATH`` so this test can build into +``tmp_path`` instead of mutating the committed ``release/notebooks/`` +file. That keeps the test parallel-safe (pytest-xdist running both +parametrised cases at once won't race), interrupt-safe (an interrupted +run can't leave the working tree dirty), and side-effect-free (the +committed notebook is never touched). """ from __future__ import annotations -import shutil import subprocess import sys from pathlib import Path @@ -38,44 +39,36 @@ def test_builder_is_byte_stable_and_matches_committed( builder_name: str, notebook_name: str, ) -> None: - """Run the builder twice into separate temp homes and assert the - emitted ``.ipynb`` files are byte-identical, and that the committed - notebook matches one of them. - - The builder writes to ``release/notebooks/`` by default, - so we redirect by snapshotting + restoring the committed file - around the build calls. + """Build twice into ``tmp_path`` via ``--out``; assert the two runs + produce byte-identical output and that the committed notebook + matches them. """ builder_path = _SCRIPTS_DIR / builder_name committed_path = _NOTEBOOKS_DIR / notebook_name assert builder_path.exists(), f"missing builder: {builder_path}" assert committed_path.exists(), f"missing committed notebook: {committed_path}" - backup = tmp_path / "committed_backup.ipynb" - shutil.copy2(committed_path, backup) - run_a = tmp_path / "run_a.ipynb" run_b = tmp_path / "run_b.ipynb" - try: - subprocess.run( # noqa: S603 — sys.executable + repo-internal builder path - [sys.executable, str(builder_path)], check=True, cwd=_REPO_ROOT - ) - shutil.copy2(committed_path, run_a) - subprocess.run( # noqa: S603 — sys.executable + repo-internal builder path - [sys.executable, str(builder_path)], check=True, cwd=_REPO_ROOT - ) - shutil.copy2(committed_path, run_b) + subprocess.run( # noqa: S603 — sys.executable + repo-internal builder path + [sys.executable, str(builder_path), "--out", str(run_a)], + check=True, + cwd=_REPO_ROOT, + ) + subprocess.run( # noqa: S603 — sys.executable + repo-internal builder path + [sys.executable, str(builder_path), "--out", str(run_b)], + check=True, + cwd=_REPO_ROOT, + ) - assert run_a.read_bytes() == run_b.read_bytes(), ( - f"{builder_name}: two runs produced different bytes — cell IDs are " - "non-deterministic; pass an explicit ``id=`` to nbformat cell " - "constructors (see scripts/_release_notebook_common.py)" - ) - assert backup.read_bytes() == run_a.read_bytes(), ( - f"{notebook_name}: committed file does not match a fresh build of " - f"{builder_name} — re-run the builder and commit the result " - "(audit-artifact-sync, same pattern as PR 4.1 / 5.1 / 5.2)" - ) - finally: - shutil.copy2(backup, committed_path) + assert run_a.read_bytes() == run_b.read_bytes(), ( + f"{builder_name}: two runs produced different bytes — cell IDs are " + "non-deterministic; pass an explicit ``id=`` to nbformat cell " + "constructors (see scripts/_release_notebook_common.py)" + ) + assert committed_path.read_bytes() == run_a.read_bytes(), ( + f"{notebook_name}: committed file does not match a fresh build of " + f"{builder_name} — re-run the builder and commit the result " + "(audit-artifact-sync, same pattern as PR 4.1 / 5.1 / 5.2)" + ) From 2987ca239ff11487565420c5033507cfbb098062 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Thu, 7 May 2026 08:55:16 +0300 Subject: [PATCH 5/6] PR 6.1 CI fixups: pin ruff, register python3 kernelspec for nbclient Two real CI failures on the rev-3 commit, both fixed here: 1. **Lint job** failed with ``Would reformat: release/notebooks/02_relational_feature_engineering.ipynb``. CI's ``pip install ruff`` resolved to 0.15.12; my local was 0.11.13. Newer ruff has slightly different ``.ipynb`` cell formatting rules, so the committed notebook (formatted by 0.11.13) didn't pass 0.15.12's check. Fix: pin ``ruff==0.15.12`` in the CI lint job and rebuild ``02_relational_feature_engineering.ipynb`` against 0.15.12 so committed bytes match. Pin lives only in CI (not in ``pyproject.toml`` ``[dev]``) to keep contributor laptops flexible; when CI bumps, re-run the builder locally with the matching ruff and commit the updated ``.ipynb``. 2. **Notebooks job** failed with ``jupyter_client.kernelspec.NoSuchKernel: No such kernel named python3``. GitHub-hosted runners don't ship with any Jupyter kernelspecs; ``[notebooks]`` only pulled ``nbclient`` / ``nbformat``, neither of which provides a kernelspec. The execution test passes ``kernel_name="python3"`` so it needs an actual ``python3`` spec on disk. Fix: add ``ipykernel>=6.0`` to the ``[notebooks]`` extra, plus a CI step ``python -m ipykernel install --user --name python3`` that registers the spec before nbclient boots a kernel. Co-Authored-By: Claude Opus 4.7 --- .github/workflows/ci.yml | 8 +++++++- pyproject.toml | 6 ++++++ release/notebooks/02_relational_feature_engineering.ipynb | 4 +--- 3 files changed, 14 insertions(+), 4 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index ba794cc..5f9601b 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -18,7 +18,11 @@ jobs: - uses: actions/setup-python@v5 with: python-version: "3.12" - - run: pip install ruff + # Pin ruff so notebook .ipynb formatting (which the lint job is + # strict about) doesn't drift between contributor laptops and CI + # the moment a new ruff release ships. Bump in lock-step with + # any local re-runs of ``scripts/build_release_notebook_*.py``. + - run: pip install 'ruff==0.15.12' - run: ruff check . - run: ruff format --check . @@ -148,6 +152,8 @@ jobs: with: python-version: "3.12" - run: pip install -e ".[dev,scripts,notebooks]" + - name: Register python3 kernelspec for nbclient + run: python -m ipykernel install --user --name python3 - name: Build the intermediate public bundle (only tier the notebooks need) run: python scripts/build_public_release.py release --tier intermediate - name: Execute release notebooks end-to-end diff --git a/pyproject.toml b/pyproject.toml index 689426d..33d2b67 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -64,6 +64,12 @@ publish = [ notebooks = [ "nbclient>=0.10", "nbformat>=5.10", + # ``ipykernel`` provides the ``python3`` kernelspec that + # ``nbclient.NotebookClient(..., kernel_name="python3")`` looks up. + # Without it CI fails with ``NoSuchKernel: No such kernel named + # python3`` because the GitHub-hosted runner has no kernelspecs + # registered out of the box (local dev environments usually do). + "ipykernel>=6.0", "scikit-learn>=1.3", "matplotlib>=3.7", ] diff --git a/release/notebooks/02_relational_feature_engineering.ipynb b/release/notebooks/02_relational_feature_engineering.ipynb index 5d8cf46..84a3b02 100644 --- a/release/notebooks/02_relational_feature_engineering.ipynb +++ b/release/notebooks/02_relational_feature_engineering.ipynb @@ -176,9 +176,7 @@ " account_lead_count=(\"lead_id\", \"nunique\"),\n", " )\n", " .assign(\n", - " account_avg_touches_per_lead=lambda d: (\n", - " d[\"account_total_touches\"] / d[\"account_lead_count\"]\n", - " )\n", + " account_avg_touches_per_lead=lambda d: d[\"account_total_touches\"] / d[\"account_lead_count\"]\n", " )\n", " .reset_index()\n", ")\n", From 10e791f42088d4eb053150907de5847af9c1e565 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Thu, 7 May 2026 09:00:34 +0300 Subject: [PATCH 6/6] PR 6.1 CI fixup: gate builder byte-stability test on [notebooks] extra MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tests (Python 3.11/3.12) failed on the previous push with ``ModuleNotFoundError: No module named 'nbformat'``. The byte- stability test in ``tests/scripts/test_release_notebook_builders.py`` shells out to the builders, which import ``nbformat`` — but the main ``test`` CI job installs ``[dev]`` only, not ``[notebooks]``. Fix: - ``pytest.importorskip("nbformat")`` at the top of the byte-stability test, so the main ``test`` job skips it cleanly. - Have the dedicated ``notebooks`` CI job (which already installs ``[dev,scripts,notebooks]``) run both ``test_execute_notebooks.py`` and ``test_release_notebook_builders.py``. That keeps the byte- stability gate in CI without bloating the default dev install. Co-Authored-By: Claude Opus 4.7 --- .github/workflows/ci.yml | 7 +++++-- tests/scripts/test_release_notebook_builders.py | 8 ++++++++ 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 5f9601b..59f11f1 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -156,5 +156,8 @@ jobs: run: python -m ipykernel install --user --name python3 - name: Build the intermediate public bundle (only tier the notebooks need) run: python scripts/build_public_release.py release --tier intermediate - - name: Execute release notebooks end-to-end - run: pytest tests/release/notebooks/test_execute_notebooks.py -v + - name: Execute release notebooks end-to-end + builder byte-stability + run: | + pytest tests/release/notebooks/test_execute_notebooks.py \ + tests/scripts/test_release_notebook_builders.py \ + -v diff --git a/tests/scripts/test_release_notebook_builders.py b/tests/scripts/test_release_notebook_builders.py index e2f4ece..c539a77 100644 --- a/tests/scripts/test_release_notebook_builders.py +++ b/tests/scripts/test_release_notebook_builders.py @@ -43,6 +43,14 @@ def test_builder_is_byte_stable_and_matches_committed( produce byte-identical output and that the committed notebook matches them. """ + # ``nbformat`` lives in the optional ``[notebooks]`` extra; the + # main ``test`` CI job installs only ``[dev]`` and would otherwise + # see the subprocess-invoked builders crash with + # ``ModuleNotFoundError: nbformat``. The dedicated ``notebooks`` + # CI job installs ``[dev,scripts,notebooks]`` and runs this test + # alongside ``test_execute_notebooks.py``. + pytest.importorskip("nbformat", reason="nbformat not installed (use [notebooks] extra)") + builder_path = _SCRIPTS_DIR / builder_name committed_path = _NOTEBOOKS_DIR / notebook_name assert builder_path.exists(), f"missing builder: {builder_path}"