diff --git a/.agent-plan.md b/.agent-plan.md index 6ee4d2d..f715f35 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -58,8 +58,8 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family - [x] PR 5.2 (self-review pass): brutal review of the first revision caught real bugs the reviewer would otherwise have to call out. Fixes folded into the PR before review: (#1) `run_packager` validate→write order — both packagers were writing the README/metadata even when validation failed; the validation gate now early-returns with `errors` populated and zero artifacts on disk; new tests exercise the no-write path on both sides. (#2) Instructor README was inlining the public 3-tier README for a 1-tier dataset; replaced with a dedicated `INSTRUCTOR_BODY` constant that opens by linking to the public dataset, describes only the instructor-specific additions (full-horizon tables, hidden DAG, latent registry, mechanism summary), and uses the single-tier tree block. (#3) `validate_upload_dir_safe` now also rejects strict descendants of `release_dir` (e.g. `--huggingface-dir release/intro` would otherwise rmtree the intro bundle); allow-list keeps the canonical `release/{kaggle,huggingface,huggingface-instructor}` direct-children. (#4) `[publish]` extra in `pyproject.toml` (`datasets>=2.14`, `kaggle>=1.6`) makes the gated `load_dataset()` / Kaggle-CLI tests installable in a single command — closes the "G12.3/G12.4 untested in CI" gap to a one-line install. (#5) Shared-primitives extraction finished: `SOURCE_TREE_BLOCK`, `validate_readme_substitution`, `replace_file`, `replace_dir`, `load_manifest` all moved to `scripts/_release_common.py`; both packagers reduced to imports. (#6) Hand-rolled YAML renderer (60 lines + brittle quoting heuristic) replaced with `yaml.safe_dump` + a 4-line `_IndentedDumper` subclass that forces indent-2 on top-level sequences. (#7) Dead `--owner` / `--dataset-slug` CLI flags removed (PR 7.2 will add them when actually needed). (#8) `assemble_upload_dir` now takes `rendered_readme` as a parameter and writes it itself; the public name no longer lies about producing a complete tree. (#9) `build_config_for_tier` made pure (no I/O); `_assert_tier_dir_exists` does the cheap manifest-stat preflight. (#10) `--default-config` with `--variant=instructor` now errors instead of silently ignoring. (#11) Instructor tree-diagram drops the hardcoded "9 tables" claim. (#13–#16) Visual cleanups (duplicate divider, ruff-split imports, `COVER_IMAGE_FILENAME`-vs-`Path.name` redundancy, speculative comment about HF split rename). (#17) Test cruft removed (unused `tmp_path`, dead `tag_lines`); em-dash YAML round-trip parametrised for the instructor `pretty_name`. Net: 1223/1223 tests pass + 5 gated skips (4 `datasets`-SDK round-trip + 1 Kaggle `kaggle`-SDK from PR 5.1); ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR doesn't touch the bundle shape). ### Phase 6 — Notebook sequence + adversarial framing -- [ ] `release/notebooks/{02_relational_feature_engineering,03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb` -- [ ] Update `01_baseline_lead_scoring.ipynb` to reproduce validation report metrics +- [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added. Notebook 01 trains LR + HistGBM on the public `intermediate` bundle using the **same feature set as the validation report** (drops only IDs and the label, mirrors `release_quality._partition_columns`), so the G13.2 reproduction gate compares apples to apples. This means notebook 01 **keeps** `total_touches_all` (the documented leakage trap) — narrative cell calls it out explicitly and forward-points to notebook 03 (PR 6.2) which dissects what dropping the trap does to performance. Notebook 02 by contrast **drops** the trap from the flat baseline so the relational lift attribution stays clean (its goal is teaching feature engineering, not reproducing the report). Targets are loaded at runtime from `release/notebooks/_release_targets.json` (audit-synced against `release/validation/validation_report.json` by `tests/release/notebooks/test_release_targets_match_report.py`); per-metric tolerances replace the original flat ±0.05 (AUC/Brier ±0.02, AP / top-decile ±0.05). Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on **train leads only**, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, prints a 4-row metric panel + delta panel, and pins the four model AUCs and the headline `GBM(eng) − GBM(flat)` lift via `assert_within_tolerance` (sign-aware `assert lift > 0` on top of the absolute tolerance). Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive (the cross-seed `gbm_auc` spread on this bundle is ~0.027); seed-sweep harness lands in PR 6.2's notebook 04. Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` and `top_decile_rate` mirror `release_quality._precision_at_k` / `_top_decile_rate` (locked in by mirror tests), and `assert_within_tolerance` is hardened against silent passes on non-finite metrics or incomplete per-metric tolerance maps. G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the intermediate bundle via `python scripts/build_public_release.py release --tier intermediate` (only tier the notebooks need) then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised, gated on bundles-present). G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`. Builders (`scripts/build_release_notebook_{01,02}.py`, sharing `scripts/_release_notebook_common.py`) emit deterministic byte-for-byte notebook JSON via explicit `cell_NNN` IDs (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2, locked in by `tests/scripts/test_release_notebook_builders.py` which builds twice into `tmp_path` via the new `--out PATH` flag and diffs against the committed file without ever touching the working tree) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree. Net: 1250/1250 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. +- [ ] `release/notebooks/{03_leakage_and_time_windows,04_lift_calibration_value_ranking}.ipynb` - [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml` - [ ] `docs/release/{break_me_guide,v2_decision_log}.md` diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 323f858..59f11f1 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -18,7 +18,11 @@ jobs: - uses: actions/setup-python@v5 with: python-version: "3.12" - - run: pip install ruff + # Pin ruff so notebook .ipynb formatting (which the lint job is + # strict about) doesn't drift between contributor laptops and CI + # the moment a new ruff release ships. Bump in lock-step with + # any local re-runs of ``scripts/build_release_notebook_*.py``. + - run: pip install 'ruff==0.15.12' - run: ruff check . - run: ruff format --check . @@ -138,3 +142,22 @@ jobs: - name: Skip v7 (no dataset) if: steps.check-v7.outputs.found != 'true' run: echo "No v7 datasets found — skipping v7 validation" + + notebooks: + name: Execute release notebooks (G13.1) + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + - run: pip install -e ".[dev,scripts,notebooks]" + - name: Register python3 kernelspec for nbclient + run: python -m ipykernel install --user --name python3 + - name: Build the intermediate public bundle (only tier the notebooks need) + run: python scripts/build_public_release.py release --tier intermediate + - name: Execute release notebooks end-to-end + builder byte-stability + run: | + pytest tests/release/notebooks/test_execute_notebooks.py \ + tests/scripts/test_release_notebook_builders.py \ + -v diff --git a/pyproject.toml b/pyproject.toml index b7f1de8..33d2b67 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -54,6 +54,25 @@ publish = [ "datasets>=2.14", "kaggle>=1.6", ] +# Optional dependencies for executing the public release notebooks. +# Installing this extra (``pip install -e ".[notebooks]"``) enables the +# G13.1 acceptance gate: the CI ``notebooks`` job nbclient-executes +# ``release/notebooks/*.ipynb`` end-to-end against a freshly built +# public bundle, asserting the notebooks reproduce validation_report.md +# metrics within ±0.05 (G13.2) and never load instructor artefacts +# (G13.3). +notebooks = [ + "nbclient>=0.10", + "nbformat>=5.10", + # ``ipykernel`` provides the ``python3`` kernelspec that + # ``nbclient.NotebookClient(..., kernel_name="python3")`` looks up. + # Without it CI fails with ``NoSuchKernel: No such kernel named + # python3`` because the GitHub-hosted runner has no kernelspecs + # registered out of the box (local dev environments usually do). + "ipykernel>=6.0", + "scikit-learn>=1.3", + "matplotlib>=3.7", +] [project.scripts] leadforge = "leadforge.cli.main:app" @@ -74,6 +93,16 @@ select = ["E", "F", "I", "N", "W", "UP", "B", "C4", "PT", "S"] [tool.ruff.lint.per-file-ignores] "tests/**/*" = ["S101", "S108"] +# Release notebooks deliberately use ``assert`` for contract checks +# (path discipline, snapshot-safe joins, G13.2 tolerance gate). ``assert`` +# is the conventional notebook idiom and these are exactly the cells we +# want to fail loud on regression. +"release/notebooks/**/*.ipynb" = ["S101"] +# Notebook builders emit dedented heredocs whose lines render as +# markdown tables and print-statement output inside the notebook. +# Line length is a property of the rendered cell, not the .py source, +# so 100c is the wrong yardstick here. +"scripts/build_release_notebook_*.py" = ["E501"] [tool.mypy] python_version = "3.11" diff --git a/release/notebooks/01_baseline_lead_scoring.ipynb b/release/notebooks/01_baseline_lead_scoring.ipynb index 8861584..e6d6b34 100644 --- a/release/notebooks/01_baseline_lead_scoring.ipynb +++ b/release/notebooks/01_baseline_lead_scoring.ipynb @@ -2,354 +2,355 @@ "cells": [ { "cell_type": "markdown", + "id": "cell_000", "metadata": {}, - "source": [ - "# Baseline Lead Scoring Models\n", - "\n", - "This notebook trains baseline models on the **LeadForge B2B Lead Scoring** dataset.\n", - "It works directly from the pre-generated Parquet files -- no `leadforge` installation required.\n", - "\n", - "We'll cover:\n", - "1. Loading the task splits\n", - "2. Exploring the features\n", - "3. Training Logistic Regression and Gradient Boosting baselines\n", - "4. Evaluating with AUC, PR-AUC, and Precision@K\n", - "5. Value-aware ranking (probability vs. expected value)\n", - "6. Feature importance\n", - "\n", - "**Requirements:** `pandas`, `scikit-learn`, `matplotlib` (all available in Kaggle notebooks by default)." - ] + "source": "# Notebook 01 — Baseline Lead Scoring\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier (the\nrelease default).\n\n**Goal:** train Logistic Regression and Histogram Gradient Boosting\nbaselines on the snapshot-safe public bundle, and verify they\nreproduce the cross-seed-median metrics in\n[`release/validation/validation_report.md`](../validation/validation_report.md)\nwithin the per-metric tolerances fixed by acceptance gate **G13.2**.\n\n**Public path discipline (G13.3).** This notebook reads only from\nthe public bundle at `release/intermediate/`. The instructor\ncompanion (`release/intermediate_instructor/`, with full-horizon\nevent tables, the latent registry, the hidden DAG, and the\nmechanism summary) is **not** loaded — public modelling work must\nnever depend on instructor-only artefacts." }, { "cell_type": "markdown", + "id": "cell_001", "metadata": {}, - "source": [ - "## 1. Load the data\n", - "\n", - "We use the `intermediate` difficulty tier. Change the path to `intro/` or `advanced/` to try other tiers." - ] + "source": "## 1. Setup" }, { "cell_type": "code", "execution_count": null, + "id": "cell_002", "metadata": {}, "outputs": [], "source": [ + "from __future__ import annotations\n", + "\n", + "import json\n", + "import sys\n", + "from pathlib import Path\n", + "\n", "import numpy as np\n", "import pandas as pd\n", + "from sklearn.compose import ColumnTransformer\n", + "from sklearn.ensemble import HistGradientBoostingClassifier\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.metrics import (\n", + " average_precision_score,\n", + " brier_score_loss,\n", + " roc_auc_score,\n", + ")\n", + "from sklearn.pipeline import Pipeline\n", + "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", "\n", - "# Adjust this path to your dataset location\n", - "BUNDLE = \"../intermediate\"\n", - "TASK = \"converted_within_90_days\"\n", - "\n", - "train = pd.read_parquet(f\"{BUNDLE}/tasks/{TASK}/train.parquet\")\n", - "valid = pd.read_parquet(f\"{BUNDLE}/tasks/{TASK}/valid.parquet\")\n", - "test = pd.read_parquet(f\"{BUNDLE}/tasks/{TASK}/test.parquet\")\n", + "sys.path.insert(0, str(Path.cwd()))\n", + "from _notebook_utils import (\n", + " assert_within_tolerance,\n", + " precision_at_k,\n", + " top_decile_rate,\n", + ")\n", "\n", - "print(f\"Train: {len(train):,} rows\")\n", - "print(f\"Valid: {len(valid):,} rows\")\n", - "print(f\"Test: {len(test):,} rows\")\n", - "print(\"\\nConversion rates:\")\n", - "for name, df in [(\"train\", train), (\"valid\", valid), (\"test\", test)]:\n", - " print(f\" {name}: {df[TASK].mean():.1%}\")" + "SEED = 42\n", + "BUNDLE = Path(\"../intermediate\") # public student bundle\n", + "TASK = \"converted_within_90_days\"" ] }, + { + "cell_type": "markdown", + "id": "cell_003", + "metadata": {}, + "source": "## 2. Reproduction targets\n\nWe pin the cross-seed-median metrics for the *intermediate* tier\n(seeds 42–46) from `release/validation/validation_report.json`.\nThe targets live in a sibling file\n(`release/notebooks/_release_targets.json`) so they can't drift\nfrom the validation report without an audit-sync test failure\nin CI.\n\n**Per-metric tolerances** are tighter than a flat 5 % band: the\ncross-seed standard deviation in the report is well under 0.02\non AUC and Brier, and a flat ±0.05 would let a regression slip\nthrough. Average-precision and the small-`k` `top_decile_rate`\nstay at ±0.05 because their seed-to-seed variance is larger." + }, { "cell_type": "code", "execution_count": null, + "id": "cell_004", "metadata": {}, "outputs": [], "source": [ - "# Feature dictionary\n", - "feat_dict = pd.read_csv(f\"{BUNDLE}/feature_dictionary.csv\")\n", - "feat_dict" + "with (Path.cwd() / \"_release_targets.json\").open() as fh:\n", + " targets = json.load(fh)[\"intermediate\"]\n", + "\n", + "# Re-key the validation report's metric names into the metric\n", + "# names this notebook prints below, so the gate compares apples\n", + "# to apples.\n", + "VALIDATION_REPORT_TARGETS = {\n", + " \"lr_auc\": targets[\"lr_auc\"],\n", + " \"gbm_auc\": targets[\"gbm_auc\"],\n", + " \"lr_average_precision\": targets[\"lr_average_precision\"],\n", + " \"lr_brier\": targets[\"brier_score\"],\n", + " \"lr_top_decile_rate\": targets[\"top_decile_rate\"],\n", + "}\n", + "TOLERANCES = {\n", + " \"lr_auc\": 0.02, # G13.2 — tighter than a flat 5%\n", + " \"gbm_auc\": 0.02,\n", + " \"lr_average_precision\": 0.05, # higher seed variance\n", + " \"lr_brier\": 0.02,\n", + " \"lr_top_decile_rate\": 0.05, # small-k variance\n", + "}\n", + "for k, v in VALIDATION_REPORT_TARGETS.items():\n", + " print(f\" target {k:<24s} {v:.4f} (tol ±{TOLERANCES[k]:.2f})\")" ] }, { "cell_type": "markdown", + "id": "cell_005", "metadata": {}, - "source": [ - "## 2. Explore the features" - ] + "source": "## 3. Load the bundle\n\nWe load the parquet task splits — the canonical format the\nrelease ships in. The accompanying `lead_scoring.csv` is a\nconvenience export with the same rows but coerced dtypes;\nsticking with parquet preserves nullable `Int64` / `Float64` /\n`boolean` columns the way the validator sees them." }, { "cell_type": "code", "execution_count": null, + "id": "cell_006", "metadata": {}, "outputs": [], "source": [ - "# Identify feature types\n", - "TARGET = TASK\n", - "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n", - "LEAKAGE_COLS = [c for c in train.columns if feat_dict[feat_dict[\"name\"] == c][\"leakage_risk\"].any()]\n", + "train = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"train.parquet\")\n", + "valid = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"valid.parquet\")\n", + "test = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"test.parquet\")\n", "\n", - "print(f\"Leakage-flagged columns (excluded): {LEAKAGE_COLS}\")\n", + "with (BUNDLE / \"manifest.json\").open() as fh:\n", + " manifest = json.load(fh)\n", "\n", - "feature_cols = [c for c in train.columns if c not in ID_COLS + [TARGET] + LEAKAGE_COLS]\n", - "cat_cols = [c for c in feature_cols if train[c].dtype == \"string\" or train[c].dtype == \"object\"]\n", - "bool_cols = [c for c in feature_cols if train[c].dtype == \"boolean\"]\n", - "num_cols = [c for c in feature_cols if c not in cat_cols + bool_cols]\n", + "assert manifest[\"exposure_mode\"] == \"student_public\", \"this notebook expects the public bundle\"\n", + "assert manifest[\"relational_snapshot_safe\"] is True\n", "\n", - "print(f\"\\nCategorical: {len(cat_cols)} -- {cat_cols}\")\n", - "print(f\"Boolean: {len(bool_cols)} -- {bool_cols}\")\n", - "print(f\"Numeric: {len(num_cols)} -- {num_cols}\")" + "print(f\"Train: {len(train):,} rows\")\n", + "print(f\"Valid: {len(valid):,} rows (held out — not used here)\")\n", + "print(f\"Test: {len(test):,} rows\")\n", + "print()\n", + "print(f\"Bundle exposure_mode: {manifest['exposure_mode']}\")\n", + "print(f\"Bundle snapshot_day: {manifest['snapshot_day']}\")\n", + "print(f\"Bundle horizon_days: {manifest['horizon_days']}\")\n", + "print()\n", + "print(\"Conversion rates:\")\n", + "for name, df in [(\"train\", train), (\"valid\", valid), (\"test\", test)]:\n", + " print(f\" {name}: {df[TASK].mean():.1%}\")" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", + "id": "cell_007", "metadata": {}, - "outputs": [], - "source": [ - "# Missing values\n", - "missing = train[feature_cols].isnull().sum()\n", - "missing = missing[missing > 0].sort_values(ascending=False)\n", - "if len(missing) > 0:\n", - " print(\"Missing values in training set:\")\n", - " for col, count in missing.items():\n", - " print(f\" {col}: {count} ({count / len(train):.1%})\")\n", - "else:\n", - " print(\"No missing values in training set.\")" - ] + "source": "## 4. Feature selection\n\nWe use the **same feature set as `release/validation/validation_report.json`**\nso the gate in section 7 is a real reproduction check rather\nthan a related-but-different number. That means we drop only\nthe IDs and the label — every other column in `train` (including\n`total_touches_all`, the documented leakage trap) goes into the\npipeline.\n\n**About `total_touches_all`.** The feature dictionary flags it\nwith `leakage_risk = True`: it counts touches over the full\n90-day horizon, which is post-snapshot data. The validation\nreport keeps it in the panel anyway because (a) its standalone\nAUC is barely above 0.55 (see the *post_snapshot_aggregates*\nbaseline column in the report) and (b) the report exists to\nmeasure the v1 dataset's *as-shipped* difficulty, leakage trap\nincluded. **Notebook 03** *(coming in PR 6.2)* walks through\nwhat dropping the trap does to performance and how to detect\nsimilar traps from feature audits alone." }, { "cell_type": "code", "execution_count": null, + "id": "cell_008", "metadata": {}, "outputs": [], "source": [ - "# Summary statistics for numeric features\n", - "train[num_cols].describe().T" + "feat_dict = pd.read_csv(BUNDLE / \"feature_dictionary.csv\")\n", + "trap_cols = feat_dict.loc[feat_dict[\"leakage_risk\"].astype(bool), \"name\"].tolist()\n", + "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n", + "# Mirrors ``release_quality._partition_columns`` — IDs + label only.\n", + "EXCLUDE = set(ID_COLS + [TASK])\n", + "\n", + "feature_cols = [c for c in train.columns if c not in EXCLUDE]\n", + "cat_cols = [\n", + " c\n", + " for c in feature_cols\n", + " if not (pd.api.types.is_bool_dtype(train[c]) or pd.api.types.is_numeric_dtype(train[c]))\n", + "]\n", + "num_cols = [c for c in feature_cols if c not in cat_cols]\n", + "\n", + "print(f\"Leakage-trap columns kept (see narrative above): {trap_cols}\")\n", + "print(f\"Categorical features ({len(cat_cols)}): {cat_cols}\")\n", + "print(f\"Numeric features ({len(num_cols)}): {num_cols}\")" ] }, { "cell_type": "markdown", + "id": "cell_009", "metadata": {}, - "source": [ - "## 3. Build preprocessing pipeline" - ] + "source": "## 5. Preprocessing pipeline\n\nMirrors `leadforge.validation.release_quality._build_pipeline`\nso the notebook's metric panel and the validation report's\nmetric panel agree by construction:\n\n- numeric: median-impute, then `StandardScaler`\n- categorical: most-frequent-impute, then dense `OneHotEncoder`\n with `handle_unknown=\"ignore\"`" }, { "cell_type": "code", "execution_count": null, + "id": "cell_010", "metadata": {}, "outputs": [], "source": [ - "from sklearn.compose import ColumnTransformer\n", - "from sklearn.impute import SimpleImputer\n", - "from sklearn.pipeline import Pipeline\n", - "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", - "\n", - "\n", - "# Convert boolean columns to int for sklearn\n", - "def prep_df(df):\n", - " out = df[feature_cols].copy()\n", - " for c in bool_cols:\n", - " out[c] = out[c].astype(\"Int64\")\n", + "def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame:\n", + " out = df.copy()\n", + " for c in cat_cols:\n", + " out[c] = out[c].astype(object).where(out[c].notna(), None)\n", " return out\n", "\n", "\n", - "numeric_features = num_cols + bool_cols\n", - "categorical_features = cat_cols\n", + "x_train = _sanitize_categoricals(train[feature_cols])\n", + "x_test = _sanitize_categoricals(test[feature_cols])\n", + "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", + "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", "\n", - "preprocessor = ColumnTransformer(\n", - " transformers=[\n", - " (\n", - " \"num\",\n", - " Pipeline(\n", - " [\n", - " (\"imputer\", SimpleImputer(strategy=\"median\")),\n", - " (\"scaler\", StandardScaler()),\n", - " ]\n", - " ),\n", - " numeric_features,\n", - " ),\n", - " (\n", - " \"cat\",\n", - " Pipeline(\n", - " [\n", - " (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n", - " (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n", - " ]\n", - " ),\n", - " categorical_features,\n", - " ),\n", + "numeric_t = Pipeline([(\"imputer\", SimpleImputer(strategy=\"median\")), (\"scaler\", StandardScaler())])\n", + "categorical_t = Pipeline(\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n", + " (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n", " ]\n", ")\n", - "\n", - "X_train = prep_df(train)\n", - "y_train = train[TARGET].astype(int)\n", - "X_test = prep_df(test)\n", - "y_test = test[TARGET].astype(int)\n", - "\n", - "print(f\"X_train: {X_train.shape}, X_test: {X_test.shape}\")" + "preprocessor = ColumnTransformer(\n", + " transformers=[\n", + " (\"num\", numeric_t, num_cols),\n", + " (\"cat\", categorical_t, cat_cols),\n", + " ],\n", + " remainder=\"drop\",\n", + ")" ] }, { "cell_type": "markdown", + "id": "cell_011", "metadata": {}, - "source": [ - "## 4. Train baselines and evaluate" - ] + "source": "## 6. Train baselines and score the test split" }, { "cell_type": "code", "execution_count": null, + "id": "cell_012", "metadata": {}, "outputs": [], "source": [ - "from sklearn.ensemble import GradientBoostingClassifier\n", - "from sklearn.linear_model import LogisticRegression\n", - "from sklearn.metrics import average_precision_score, roc_auc_score\n", + "lr_pipe = Pipeline(\n", + " [\n", + " (\"preprocessor\", preprocessor),\n", + " (\n", + " \"classifier\",\n", + " LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=SEED),\n", + " ),\n", + " ]\n", + ")\n", + "gbm_pipe = Pipeline(\n", + " [\n", + " (\"preprocessor\", preprocessor),\n", + " (\"classifier\", HistGradientBoostingClassifier(random_state=SEED)),\n", + " ]\n", + ")\n", "\n", - "models = {\n", - " \"Logistic Regression\": LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=42),\n", - " \"Gradient Boosting\": GradientBoostingClassifier(n_estimators=200, random_state=42),\n", + "lr_pipe.fit(x_train, y_train)\n", + "gbm_pipe.fit(x_train, y_train)\n", + "\n", + "lr_probs = lr_pipe.predict_proba(x_test)[:, 1]\n", + "gbm_probs = gbm_pipe.predict_proba(x_test)[:, 1]\n", + "\n", + "metrics = {\n", + " \"lr_auc\": float(roc_auc_score(y_test, lr_probs)),\n", + " \"gbm_auc\": float(roc_auc_score(y_test, gbm_probs)),\n", + " \"lr_average_precision\": float(average_precision_score(y_test, lr_probs)),\n", + " \"lr_brier\": float(brier_score_loss(y_test, lr_probs)),\n", + " \"lr_top_decile_rate\": top_decile_rate(lr_probs, y_test),\n", + " # Print-only; not pinned (the validation report tracks\n", + " # ``top_decile_rate`` instead, which we gate above).\n", + " \"lr_precision_at_50\": precision_at_k(lr_probs, y_test, 50),\n", + " \"lr_precision_at_100\": precision_at_k(lr_probs, y_test, 100),\n", + " \"lr_precision_at_200\": precision_at_k(lr_probs, y_test, 200),\n", "}\n", - "\n", - "results = []\n", - "fitted_models = {}\n", - "\n", - "for name, model in models.items():\n", - " pipe = Pipeline([(\"preprocess\", preprocessor), (\"model\", model)])\n", - " pipe.fit(X_train, y_train)\n", - " y_prob = pipe.predict_proba(X_test)[:, 1]\n", - "\n", - " auc = roc_auc_score(y_test, y_prob)\n", - " pr_auc = average_precision_score(y_test, y_prob)\n", - "\n", - " # Precision@K\n", - " for k in [25, 50, 100]:\n", - " top_k_idx = np.argsort(-y_prob)[:k]\n", - " p_at_k = y_test.iloc[top_k_idx].mean()\n", - " base_rate = y_test.mean()\n", - " lift = p_at_k / base_rate\n", - " results.append(\n", - " {\n", - " \"Model\": name,\n", - " \"Metric\": f\"P@{k}\",\n", - " \"Value\": f\"{p_at_k:.3f}\",\n", - " \"Lift\": f\"{lift:.2f}x\",\n", - " }\n", - " )\n", - "\n", - " results.append({\"Model\": name, \"Metric\": \"ROC-AUC\", \"Value\": f\"{auc:.3f}\", \"Lift\": \"\"})\n", - " results.append({\"Model\": name, \"Metric\": \"PR-AUC\", \"Value\": f\"{pr_auc:.3f}\", \"Lift\": \"\"})\n", - " fitted_models[name] = pipe\n", - " print(f\"{name}: AUC={auc:.3f}, PR-AUC={pr_auc:.3f}\")\n", - "\n", - "pd.DataFrame(results)" + "for k, v in metrics.items():\n", + " print(f\" {k:<24s} {v:.4f}\")" ] }, { "cell_type": "markdown", + "id": "cell_013", "metadata": {}, - "source": [ - "## 5. Value-aware ranking\n", - "\n", - "When deals have different sizes (`expected_acv`), ranking by probability alone leaves money on the table.\n", - "Ranking by expected value (P(convert) x ACV) captures more revenue in the top-K." - ] + "source": "## 7. Tolerance check (G13.2)\n\nThe notebook's printed metrics must match the cross-seed medians\nin `validation_report.json` to within the per-metric tolerances\ndeclared in section 2. If a future change breaks this, the\nassertion below fails — and CI catches it, because the same\ncell runs under `nbclient` in the `notebooks` job." }, { "cell_type": "code", "execution_count": null, + "id": "cell_014", "metadata": {}, "outputs": [], "source": [ - "if \"expected_acv\" in test.columns:\n", - " best_pipe = fitted_models[\"Gradient Boosting\"]\n", - " y_prob = best_pipe.predict_proba(X_test)[:, 1]\n", - " acv = test[\"expected_acv\"].fillna(test[\"expected_acv\"].median()).values\n", - " ev = y_prob * acv\n", - "\n", - " true_acv = test[\"expected_acv\"].fillna(0).values\n", - " converted = y_test.values.astype(bool)\n", - "\n", - " for k in [25, 50, 100]:\n", - " # Probability ranking\n", - " prob_top_k = np.argsort(-y_prob)[:k]\n", - " prob_acv = true_acv[prob_top_k][converted[prob_top_k]].sum()\n", - "\n", - " # EV ranking\n", - " ev_top_k = np.argsort(-ev)[:k]\n", - " ev_acv = true_acv[ev_top_k][converted[ev_top_k]].sum()\n", - "\n", - " uplift = (ev_acv - prob_acv) / prob_acv * 100 if prob_acv > 0 else 0\n", - " print(\n", - " f\"K={k}: Prob ranking ${prob_acv:,.0f} | \"\n", - " f\"EV ranking ${ev_acv:,.0f} | Uplift: {uplift:+.1f}%\"\n", - " )\n", - "else:\n", - " print(\"No expected_acv column found.\")" + "assert_within_tolerance(\n", + " observed=metrics,\n", + " target=VALIDATION_REPORT_TARGETS,\n", + " tolerances=TOLERANCES,\n", + " label=\"notebook 01 vs validation_report.json (intermediate tier)\",\n", + ")\n", + "print(\"OK — all gated metrics are within their per-metric tolerance.\")" ] }, { "cell_type": "markdown", + "id": "cell_015", "metadata": {}, - "source": [ - "## 6. Feature importance (Gradient Boosting)" - ] + "source": "## 8. Decile lift chart\n\nStandard sanity-check for ranking quality: sort the test set by\nscore, bucket into deciles, plot the per-decile conversion rate\nvs the base rate." }, { "cell_type": "code", "execution_count": null, + "id": "cell_016", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", - "gbm_pipe = fitted_models[\"Gradient Boosting\"]\n", - "gbm_model = gbm_pipe.named_steps[\"model\"]\n", - "preproc = gbm_pipe.named_steps[\"preprocess\"]\n", - "\n", - "# Get feature names after encoding\n", - "num_names = numeric_features\n", - "cat_encoder = preproc.named_transformers_[\"cat\"].named_steps[\"encoder\"]\n", - "cat_names = list(cat_encoder.get_feature_names_out(categorical_features))\n", - "all_names = num_names + cat_names\n", - "\n", - "importances = gbm_model.feature_importances_\n", - "feat_imp = pd.Series(importances, index=all_names).sort_values(ascending=False)\n", - "\n", - "top_n = 15\n", - "fig, ax = plt.subplots(figsize=(8, 5))\n", - "feat_imp.head(top_n).plot.barh(ax=ax)\n", - "ax.set_xlabel(\"Importance\")\n", - "ax.set_title(f\"Top {top_n} Features (Gradient Boosting)\")\n", - "ax.invert_yaxis()\n", + "order = np.argsort(-lr_probs, kind=\"stable\")\n", + "y_sorted = y_test[order]\n", + "n = len(y_test)\n", + "edges = np.linspace(0, n, 11, dtype=int)\n", + "decile_rate = np.array([y_sorted[edges[i] : edges[i + 1]].mean() for i in range(10)])\n", + "base_rate = y_test.mean()\n", + "\n", + "fig, ax = plt.subplots(figsize=(7, 4))\n", + "ax.bar(range(1, 11), decile_rate, color=\"#3b82f6\")\n", + "ax.axhline(\n", + " base_rate,\n", + " color=\"#ef4444\",\n", + " linestyle=\"--\",\n", + " label=f\"base rate ({base_rate:.1%})\",\n", + ")\n", + "ax.set_xticks(range(1, 11))\n", + "ax.set_xlabel(\"Score decile (1 = highest)\")\n", + "ax.set_ylabel(\"Conversion rate\")\n", + "ax.set_title(\"LR decile lift — intermediate tier (seed 42)\")\n", + "ax.legend(loc=\"upper right\")\n", "plt.tight_layout()\n", - "plt.show()\n", - "\n", - "print(f\"\\nTop {top_n} features:\")\n", - "for name, imp in feat_imp.head(top_n).items():\n", - " print(f\" {name}: {imp:.4f}\")" + "plt.show()" ] }, { "cell_type": "markdown", + "id": "cell_017", + "metadata": {}, + "source": "## 9. Calibration plot\n\nReliability diagram: bin predicted probabilities into 10 equal-\nwidth buckets, plot mean predicted vs mean observed. The\nvalidation report's reference reliability plot for the\nintermediate tier lives at\n`release/validation/figures/calibration_intermediate.png`." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_018", "metadata": {}, + "outputs": [], "source": [ - "## 7. Try other difficulty tiers\n", - "\n", - "Change `BUNDLE` at the top of this notebook to point to `intro/` or `advanced/` and re-run all cells.\n", - "You should see:\n", - "- **Intro:** Higher AUC (cleaner signal, ~2% missing values)\n", - "- **Intermediate:** Moderate AUC (~8% missing values, more noise)\n", - "- **Advanced:** Lower AUC (~18% missing values, much noisier)\n", - "\n", - "## Explore the relational tables\n", - "\n", - "The flat task splits are derived from 9 relational tables under `tables/`. You can engineer your own features:\n", - "\n", - "```python\n", - "touches = pd.read_parquet(f\"{BUNDLE}/tables/touches.parquet\")\n", - "sessions = pd.read_parquet(f\"{BUNDLE}/tables/sessions.parquet\")\n", - "# ... join, aggregate, and build features from raw events\n", - "```\n", - "\n", - "See the [leadforge README](https://github.com/leadforge-dev/leadforge) for more details." + "edges = np.linspace(0.0, 1.0, 11)\n", + "mean_pred = []\n", + "mean_actual = []\n", + "for i in range(10):\n", + " lo, hi = edges[i], edges[i + 1]\n", + " mask = (lr_probs >= lo) & ((lr_probs <= hi) if i == 9 else (lr_probs < hi))\n", + " if mask.sum() == 0:\n", + " continue\n", + " mean_pred.append(lr_probs[mask].mean())\n", + " mean_actual.append(y_test[mask].mean())\n", + "\n", + "fig, ax = plt.subplots(figsize=(5, 5))\n", + "ax.plot([0, 1], [0, 1], color=\"#9ca3af\", linestyle=\"--\", label=\"perfect calibration\")\n", + "ax.plot(mean_pred, mean_actual, marker=\"o\", color=\"#3b82f6\", label=\"LR\")\n", + "ax.set_xlim(0, 1)\n", + "ax.set_ylim(0, 1)\n", + "ax.set_xlabel(\"Mean predicted probability\")\n", + "ax.set_ylabel(\"Observed conversion rate\")\n", + "ax.set_title(\"Calibration — LR, intermediate tier (seed 42)\")\n", + "ax.legend(loc=\"upper left\")\n", + "plt.tight_layout()\n", + "plt.show()" ] + }, + { + "cell_type": "markdown", + "id": "cell_019", + "metadata": {}, + "source": "## 10. Next\n\n- **Notebook 02** — engineer features by joining the snapshot-\n safe relational tables under `release/intermediate/tables/`,\n then measure the lift over the flat-CSV LR baseline above.\n- **Notebook 03** *(coming in PR 6.2)* — leakage and time-window\n walkthrough; works through what `total_touches_all` does to\n your AUC if you forget to drop it.\n- **Notebook 04** *(coming in PR 6.2)* — value-aware ranking\n (`expected_acv` × P(convert)), threshold selection, and the\n cohort-shift stress test." } ], "metadata": { @@ -360,9 +361,9 @@ }, "language_info": { "name": "python", - "version": "3.11.0" + "version": "3.11" } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/release/notebooks/02_relational_feature_engineering.ipynb b/release/notebooks/02_relational_feature_engineering.ipynb new file mode 100644 index 0000000..84a3b02 --- /dev/null +++ b/release/notebooks/02_relational_feature_engineering.ipynb @@ -0,0 +1,541 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cell_000", + "metadata": {}, + "source": "# Notebook 02 — Relational Feature Engineering\n\n**Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier.\n\nThe flat task split in notebook 01 is one snapshot view of a\nricher relational world. The public bundle ships seven\n**snapshot-safe** tables under `release/intermediate/tables/`:\nevery event row is filtered to `timestamp <= lead_created_at +\nsnapshot_day`, so any join you can write is **leakage-safe by\nconstruction**.\n\nThis notebook walks through:\n\n1. Loading the seven public tables.\n2. Verifying the snapshot-safe contract inline (the teachable\n moment — see the contract pass, don't just read about it).\n3. Engineering four relational features, with train-only\n discipline for any aggregation that crosses splits.\n4. Training a HistGBM on `flat ∪ engineered` columns.\n5. Reporting the AUC / AP / Brier / P@K delta vs the flat-CSV\n baseline, with a tolerance gate that fails CI if the\n headline lift regresses.\n\n**Public path discipline (G13.3).** This notebook reads only\nfrom `release/intermediate/`. The instructor companion\n(`release/intermediate_instructor/`) is **not** loaded —\nrelational feature engineering must work from the public\nartefact alone. Tables omitted from the public bundle on\npurpose (`customers`, `subscriptions`) live only in the\ninstructor companion because their mere existence reconstructs\nthe label.\n\n**Leakage-trap discipline.** Unlike notebook 01 (which\nreproduces the validation report's panel verbatim and\ntherefore keeps `total_touches_all`), notebook 02 **drops**\nthe trap from the flat baseline. We're teaching feature\nengineering here; mixing a known-leaky column into the\n\"before\" panel would muddy the relational lift attribution." + }, + { + "cell_type": "markdown", + "id": "cell_001", + "metadata": {}, + "source": "## 1. Setup" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_002", + "metadata": {}, + "outputs": [], + "source": [ + "from __future__ import annotations\n", + "\n", + "import json\n", + "import sys\n", + "from pathlib import Path\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn.compose import ColumnTransformer\n", + "from sklearn.ensemble import HistGradientBoostingClassifier\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.metrics import (\n", + " average_precision_score,\n", + " brier_score_loss,\n", + " roc_auc_score,\n", + ")\n", + "from sklearn.pipeline import Pipeline\n", + "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", + "\n", + "sys.path.insert(0, str(Path.cwd()))\n", + "from _notebook_utils import assert_within_tolerance, precision_at_k\n", + "\n", + "SEED = 42\n", + "BUNDLE = Path(\"../intermediate\") # public student bundle\n", + "TASK = \"converted_within_90_days\"\n", + "\n", + "with (BUNDLE / \"manifest.json\").open() as fh:\n", + " manifest = json.load(fh)\n", + "assert manifest[\"exposure_mode\"] == \"student_public\"\n", + "assert manifest[\"relational_snapshot_safe\"] is True\n", + "SNAPSHOT_DAY = int(manifest[\"snapshot_day\"])\n", + "print(f\"snapshot_day = {SNAPSHOT_DAY}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_003", + "metadata": {}, + "source": "## 2. Load the seven public tables\n\nThese are the only tables present under `release/intermediate/\ntables/`. `customers` and `subscriptions` are deliberately\nabsent — they only exist for *converted* leads, so their\npresence in a public bundle would reconstruct the label." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_004", + "metadata": {}, + "outputs": [], + "source": [ + "PUBLIC_TABLES = (\n", + " \"accounts\",\n", + " \"contacts\",\n", + " \"leads\",\n", + " \"touches\",\n", + " \"sessions\",\n", + " \"sales_activities\",\n", + " \"opportunities\",\n", + ")\n", + "tables = {name: pd.read_parquet(BUNDLE / \"tables\" / f\"{name}.parquet\") for name in PUBLIC_TABLES}\n", + "for name in PUBLIC_TABLES:\n", + " print(f\" {name:<20s} {tables[name].shape}\")\n", + "\n", + "train = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"train.parquet\")\n", + "test = pd.read_parquet(BUNDLE / \"tasks\" / TASK / \"test.parquet\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_005", + "metadata": {}, + "source": "## 3. Verify the snapshot-safe contract\n\nFor every event-table row joined back to its lead, assert\n`timestamp <= lead_created_at + snapshot_day`. The reported\n**headroom** is the *minimum* gap any event row leaves between\nits timestamp and the snapshot cutoff — a non-negative number\nwhen the contract holds. Showing the actual minimum (rather\nthan just \"all <= cutoff\") makes the receipt honest: if a\nfuture regeneration ever shaves the contract close, you'll see\nthe headroom shrink before the assertion fires." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_006", + "metadata": {}, + "outputs": [], + "source": [ + "leads_index = tables[\"leads\"][[\"lead_id\", \"lead_created_at\"]].copy()\n", + "leads_index[\"lead_created_at\"] = pd.to_datetime(leads_index[\"lead_created_at\"])\n", + "cutoff = leads_index.assign(cutoff=leads_index[\"lead_created_at\"] + pd.Timedelta(days=SNAPSHOT_DAY))\n", + "\n", + "EVENT_TABLES = [\n", + " (\"touches\", \"touch_timestamp\"),\n", + " (\"sessions\", \"session_timestamp\"),\n", + " (\"sales_activities\", \"activity_timestamp\"),\n", + " (\"opportunities\", \"created_at\"),\n", + "]\n", + "\n", + "for tbl, ts_col in EVENT_TABLES:\n", + " df = tables[tbl][[\"lead_id\", ts_col]].merge(\n", + " cutoff[[\"lead_id\", \"cutoff\"]], on=\"lead_id\", how=\"left\"\n", + " )\n", + " df[ts_col] = pd.to_datetime(df[ts_col])\n", + " violations = df[df[ts_col] > df[\"cutoff\"]]\n", + " assert len(violations) == 0, f\"{tbl}.{ts_col}: {len(violations)} rows past snapshot cutoff\"\n", + " min_headroom = (df[\"cutoff\"] - df[ts_col]).min()\n", + " min_headroom_days = float(min_headroom.total_seconds()) / 86400.0\n", + " print(\n", + " f\" {tbl}.{ts_col}: {len(df):>6,} rows; \"\n", + " f\"min headroom under cutoff: {min_headroom_days:6.2f} days\"\n", + " )\n", + "\n", + "print()\n", + "print(\"OK — every event row in every public event table satisfies\")\n", + "print(f\" timestamp <= lead_created_at + {SNAPSHOT_DAY} days.\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_007", + "metadata": {}, + "source": "## 4. Engineered features\n\nWe build four relational features. Each one starts as a\nper-lead aggregate over a single snapshot-safe table, then\njoins back into the per-lead snapshot. Aggregations that pool\nacross leads (account-level density, target encoding) are\n**fit on the train split only** and applied to test via\njoin — same train-only discipline that prevents target leakage\nin mean encoding.\n\n| # | Feature | Source table(s) | Aggregation |\n|---|---|---|---|\n| 1 | `touches_ch_*` (3 cols) | `touches` | per-lead × `touch_channel` count |\n| 2 | `account_avg_touches_per_lead` | `touches`, `leads`, train lead set | account-level rollup over **train leads only**, then merge back |\n| 3 | `days_since_last_activity` | `sales_activities`, `leads` | per-lead recency at snapshot cutoff |\n| 4 | `industry_target_encoding_train` | `accounts`, train | mean-target encoding **fit on train only** |" + }, + { + "cell_type": "markdown", + "id": "cell_008", + "metadata": {}, + "source": "### 4.1 Touch-channel breakdown" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_009", + "metadata": {}, + "outputs": [], + "source": [ + "touches = tables[\"touches\"]\n", + "channel_counts = touches.groupby([\"lead_id\", \"touch_channel\"]).size().unstack(fill_value=0)\n", + "channel_counts.columns = [f\"touches_ch_{c}\" for c in channel_counts.columns]\n", + "print(f\"channel feature columns: {list(channel_counts.columns)}\")\n", + "channel_counts.head()" + ] + }, + { + "cell_type": "markdown", + "id": "cell_010", + "metadata": {}, + "source": "### 4.2 Account-level touch density\n\nHow active is the account this lead belongs to, on average? An\naccount with many leads and many touches per lead is a\nstructurally different prospect than a one-touch account.\n\n**Train-only.** We compute the account-level rollup using only\ntrain leads' touches. Test leads in train-only-empty accounts\nfall back to 0 via `fillna`. This avoids letting test rows\ninfluence the train feature distribution — same discipline\napplied to mean-target encoding in 4.4." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_011", + "metadata": {}, + "outputs": [], + "source": [ + "train_lead_ids = set(train[\"lead_id\"].tolist())\n", + "tch_with_acct = touches.merge(tables[\"leads\"][[\"lead_id\", \"account_id\"]], on=\"lead_id\", how=\"left\")\n", + "tch_train = tch_with_acct[tch_with_acct[\"lead_id\"].isin(train_lead_ids)]\n", + "account_density = (\n", + " tch_train.groupby(\"account_id\")\n", + " .agg(\n", + " account_total_touches=(\"touch_id\", \"count\"),\n", + " account_lead_count=(\"lead_id\", \"nunique\"),\n", + " )\n", + " .assign(\n", + " account_avg_touches_per_lead=lambda d: d[\"account_total_touches\"] / d[\"account_lead_count\"]\n", + " )\n", + " .reset_index()\n", + ")\n", + "print(f\"account_density rows: {len(account_density):,} (accounts represented in train touches)\")\n", + "account_density.head()" + ] + }, + { + "cell_type": "markdown", + "id": "cell_012", + "metadata": {}, + "source": "### 4.3 Sales-activity recency at snapshot\n\nDays between the lead's most recent sales activity and the\nsnapshot cutoff (`lead_created_at + snapshot_day`). Recency\nis a classic engagement signal that's surprisingly hard to\nrecover from the flat snapshot directly." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_013", + "metadata": {}, + "outputs": [], + "source": [ + "sa = tables[\"sales_activities\"][[\"lead_id\", \"activity_timestamp\"]].copy()\n", + "sa[\"activity_timestamp\"] = pd.to_datetime(sa[\"activity_timestamp\"])\n", + "last_activity = (\n", + " sa.groupby(\"lead_id\")[\"activity_timestamp\"]\n", + " .max()\n", + " .reset_index()\n", + " .rename(columns={\"activity_timestamp\": \"last_activity_at\"})\n", + ")\n", + "last_activity = last_activity.merge(cutoff[[\"lead_id\", \"cutoff\"]], on=\"lead_id\")\n", + "last_activity[\"days_since_last_activity\"] = (\n", + " last_activity[\"cutoff\"] - last_activity[\"last_activity_at\"]\n", + ").dt.total_seconds() / 86400\n", + "last_activity[[\"lead_id\", \"days_since_last_activity\"]].head()" + ] + }, + { + "cell_type": "markdown", + "id": "cell_014", + "metadata": {}, + "source": "### 4.4 Industry target encoding (train-only, leakage-safe)\n\nReplace the `industry` string with the conversion rate observed\nfor that industry **on the training split only**. Computing\nthe encoding on test leaks the test labels into the features —\na textbook mistake; we avoid it explicitly." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_015", + "metadata": {}, + "outputs": [], + "source": [ + "tgt_enc = train.groupby(\"industry\")[TASK].mean().to_dict()\n", + "tgt_enc_global_mean = float(train[TASK].mean())\n", + "print(f\"per-industry train conversion rates ({len(tgt_enc)} industries):\")\n", + "for k, v in sorted(tgt_enc.items()):\n", + " print(f\" {k:<20s} {v:.3f}\")\n", + "print(f\" fallback global mean: {tgt_enc_global_mean:.3f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_016", + "metadata": {}, + "source": "### 4.5 Stitch features onto train and test" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_017", + "metadata": {}, + "outputs": [], + "source": [ + "ENGINEERED_NUMERIC = list(channel_counts.columns) + [\n", + " \"account_avg_touches_per_lead\",\n", + " \"days_since_last_activity\",\n", + " \"industry_target_encoding_train\",\n", + "]\n", + "\n", + "\n", + "def attach_engineered(df: pd.DataFrame) -> pd.DataFrame:\n", + " out = df.copy()\n", + " out = out.merge(channel_counts, on=\"lead_id\", how=\"left\")\n", + " for col in channel_counts.columns:\n", + " out[col] = out[col].fillna(0).astype(int)\n", + " out = out.merge(\n", + " account_density[[\"account_id\", \"account_avg_touches_per_lead\"]],\n", + " on=\"account_id\",\n", + " how=\"left\",\n", + " )\n", + " out[\"account_avg_touches_per_lead\"] = (\n", + " out[\"account_avg_touches_per_lead\"].fillna(0).astype(float)\n", + " )\n", + " out = out.merge(\n", + " last_activity[[\"lead_id\", \"days_since_last_activity\"]],\n", + " on=\"lead_id\",\n", + " how=\"left\",\n", + " )\n", + " out[\"industry_target_encoding_train\"] = out[\"industry\"].map(tgt_enc).fillna(tgt_enc_global_mean)\n", + " return out\n", + "\n", + "\n", + "train_eng = attach_engineered(train)\n", + "test_eng = attach_engineered(test)\n", + "print(f\"train_eng shape: {train_eng.shape} ({train.shape[1]} -> {train_eng.shape[1]} cols)\")\n", + "print(f\"new columns: {ENGINEERED_NUMERIC}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_018", + "metadata": {}, + "source": "## 5. Baseline + engineered models\n\nSame pipeline as notebook 01 (mirrors\n`leadforge.validation.release_quality._build_pipeline`). We\ntrain four models so the comparison is fair:\n\n| Model | Features | Compares against |\n|---|---|---|\n| LR-flat | flat snapshot, no trap | (validation report baseline) |\n| GBM-flat | flat snapshot, no trap | LR-flat |\n| LR-eng | flat ∪ engineered, no trap, no raw `industry` | LR-flat |\n| GBM-eng | flat ∪ engineered, no trap, no raw `industry` | GBM-flat — the headline lift |\n\nThe `+rel` pipelines drop the raw `industry` categorical\nbecause the train-only target encoding already represents it\nas a numeric column — feeding both would feed the same column\ntwice." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_019", + "metadata": {}, + "outputs": [], + "source": [ + "feat_dict = pd.read_csv(BUNDLE / \"feature_dictionary.csv\")\n", + "trap_cols = feat_dict.loc[feat_dict[\"leakage_risk\"].astype(bool), \"name\"].tolist()\n", + "ID_COLS = [\"account_id\", \"contact_id\", \"lead_id\", \"lead_created_at\"]\n", + "EXCLUDE = set(ID_COLS + trap_cols + [TASK])\n", + "\n", + "base_cols = [c for c in train.columns if c not in EXCLUDE]\n", + "cat_base = [\n", + " c\n", + " for c in base_cols\n", + " if not (pd.api.types.is_bool_dtype(train[c]) or pd.api.types.is_numeric_dtype(train[c]))\n", + "]\n", + "num_base = [c for c in base_cols if c not in cat_base]\n", + "# Drop raw `industry` from the +rel categorical list so the LR\n", + "# pipeline doesn't see it twice (one-hot + target-encoded).\n", + "cat_eng = [c for c in cat_base if c != \"industry\"]\n", + "print(f\"flat features: {len(base_cols)} (numeric={len(num_base)}, categorical={len(cat_base)})\")\n", + "print(f\"engineered (numeric only): {len(ENGINEERED_NUMERIC)}\")\n", + "print(f\"+rel categorical list drops: {set(cat_base) - set(cat_eng)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_020", + "metadata": {}, + "outputs": [], + "source": [ + "def _sanitize(df: pd.DataFrame, cat_cols: list[str]) -> pd.DataFrame:\n", + " out = df.copy()\n", + " for c in cat_cols:\n", + " out[c] = out[c].astype(object).where(out[c].notna(), None)\n", + " return out\n", + "\n", + "\n", + "def build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline:\n", + " numeric_t = Pipeline(\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"median\")),\n", + " (\"scaler\", StandardScaler()),\n", + " ]\n", + " )\n", + " cat_t = Pipeline(\n", + " [\n", + " (\"imputer\", SimpleImputer(strategy=\"most_frequent\")),\n", + " (\n", + " \"encoder\",\n", + " OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False),\n", + " ),\n", + " ]\n", + " )\n", + " pre = ColumnTransformer(\n", + " [(\"num\", numeric_t, num_cols), (\"cat\", cat_t, cat_cols)],\n", + " remainder=\"drop\",\n", + " )\n", + " if model == \"lr\":\n", + " clf = LogisticRegression(max_iter=1000, solver=\"lbfgs\", random_state=SEED)\n", + " else:\n", + " clf = HistGradientBoostingClassifier(random_state=SEED)\n", + " return Pipeline([(\"preprocessor\", pre), (\"classifier\", clf)])\n", + "\n", + "\n", + "def fit_and_score(\n", + " x_train_df: pd.DataFrame,\n", + " x_test_df: pd.DataFrame,\n", + " num_cols: list[str],\n", + " cat_cols: list[str],\n", + " *,\n", + " model: str,\n", + ") -> np.ndarray:\n", + " pipe = build_pipeline(num_cols, cat_cols, model=model)\n", + " pipe.fit(_sanitize(x_train_df, cat_cols), y_train)\n", + " return pipe.predict_proba(_sanitize(x_test_df, cat_cols))[:, 1]\n", + "\n", + "\n", + "y_train = train[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", + "y_test = test[TASK].astype(\"boolean\").fillna(False).astype(int).to_numpy()\n", + "base_rate = float(y_test.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_021", + "metadata": {}, + "outputs": [], + "source": [ + "num_eng = num_base + ENGINEERED_NUMERIC\n", + "\n", + "probs_lr_flat = fit_and_score(train[base_cols], test[base_cols], num_base, cat_base, model=\"lr\")\n", + "probs_gbm_flat = fit_and_score(train[base_cols], test[base_cols], num_base, cat_base, model=\"gbm\")\n", + "probs_lr_eng = fit_and_score(\n", + " train_eng[base_cols + ENGINEERED_NUMERIC],\n", + " test_eng[base_cols + ENGINEERED_NUMERIC],\n", + " num_eng,\n", + " cat_eng,\n", + " model=\"lr\",\n", + ")\n", + "probs_gbm_eng = fit_and_score(\n", + " train_eng[base_cols + ENGINEERED_NUMERIC],\n", + " test_eng[base_cols + ENGINEERED_NUMERIC],\n", + " num_eng,\n", + " cat_eng,\n", + " model=\"gbm\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_022", + "metadata": {}, + "outputs": [], + "source": [ + "def panel(probs: np.ndarray, label: str) -> dict[str, float]:\n", + " return {\n", + " \"label\": label,\n", + " \"auc\": float(roc_auc_score(y_test, probs)),\n", + " \"ap\": float(average_precision_score(y_test, probs)),\n", + " \"brier\": float(brier_score_loss(y_test, probs)),\n", + " \"p@100\": precision_at_k(probs, y_test, 100),\n", + " }\n", + "\n", + "\n", + "rows = [\n", + " panel(probs_lr_flat, \"LR flat \"),\n", + " panel(probs_gbm_flat, \"GBM flat \"),\n", + " panel(probs_lr_eng, \"LR flat+rel \"),\n", + " panel(probs_gbm_eng, \"GBM flat+rel \"),\n", + "]\n", + "print(f\"base rate: {base_rate:.3f}\")\n", + "print()\n", + "print(f\"{'model':<18s} {'AUC':>7s} {'AP':>7s} {'Brier':>7s} {'P@100':>7s}\")\n", + "for r in rows:\n", + " print(f\"{r['label']:<18s} {r['auc']:.4f} {r['ap']:.4f} {r['brier']:.4f} {r['p@100']:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_023", + "metadata": {}, + "source": "## 6. Lift over the flat baseline\n\n*Sign convention.* `ΔAUC`, `ΔAP`, `ΔP@100`: **higher is\nbetter** (positive = engineered features helped). `ΔBrier`:\n**lower is better** (Brier is a loss, so negative ΔBrier =\nimproved calibration)." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_024", + "metadata": {}, + "outputs": [], + "source": [ + "def delta(eng: np.ndarray, base: np.ndarray, name: str) -> dict[str, float]:\n", + " return {\n", + " \"label\": name,\n", + " \"auc\": float(roc_auc_score(y_test, eng) - roc_auc_score(y_test, base)),\n", + " \"ap\": float(average_precision_score(y_test, eng) - average_precision_score(y_test, base)),\n", + " \"brier\": float(brier_score_loss(y_test, eng) - brier_score_loss(y_test, base)),\n", + " \"p@100\": float(precision_at_k(eng, y_test, 100) - precision_at_k(base, y_test, 100)),\n", + " }\n", + "\n", + "\n", + "deltas = [\n", + " delta(probs_gbm_eng, probs_gbm_flat, \"GBM(eng) - GBM(flat)\"),\n", + " delta(probs_gbm_eng, probs_lr_flat, \"GBM(eng) - LR(flat) \"),\n", + " delta(probs_lr_eng, probs_lr_flat, \"LR(eng) - LR(flat) \"),\n", + "]\n", + "print(f\"{'comparison':<22s} {'ΔAUC':>8s} {'ΔAP':>8s} {'ΔBrier':>8s} {'ΔP@100':>8s}\")\n", + "for d in deltas:\n", + " print(\n", + " f\"{d['label']:<22s} {d['auc']:+8.4f} {d['ap']:+8.4f} \"\n", + " f\"{d['brier']:+8.4f} {d['p@100']:+8.4f}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "cell_025", + "metadata": {}, + "source": "## 7. Tolerance gate\n\nPin the four model AUCs and the headline lift to per-metric\ntolerances so a regression in any pipeline component (feature\nengineering, leakage discipline, sklearn upgrade) breaks CI.\nTargets and tolerances are seed-42-specific by design — this\nnotebook is reproducible-by-seed, not a cross-seed median." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell_026", + "metadata": {}, + "outputs": [], + "source": [ + "# Single-seed (seed=42) AUCs observed on the as-shipped\n", + "# intermediate bundle. Tolerances allow ±0.02 around each\n", + "# baseline (well outside numerical jitter, well inside the\n", + "# band that would let GBM(eng) silently drop below GBM(flat)).\n", + "NB02_TARGETS = {\n", + " \"lr_flat_auc\": 0.8737,\n", + " \"gbm_flat_auc\": 0.8432,\n", + " \"lr_eng_auc\": 0.8763,\n", + " \"gbm_eng_auc\": 0.8579,\n", + " \"headline_lift_auc\": 0.0147, # GBM(eng) - GBM(flat)\n", + "}\n", + "NB02_TOLERANCES = {\n", + " \"lr_flat_auc\": 0.02,\n", + " \"gbm_flat_auc\": 0.02,\n", + " \"lr_eng_auc\": 0.02,\n", + " \"gbm_eng_auc\": 0.02,\n", + " \"headline_lift_auc\": 0.015, # tighter — sign-aware below\n", + "}\n", + "\n", + "observed = {\n", + " \"lr_flat_auc\": rows[0][\"auc\"],\n", + " \"gbm_flat_auc\": rows[1][\"auc\"],\n", + " \"lr_eng_auc\": rows[2][\"auc\"],\n", + " \"gbm_eng_auc\": rows[3][\"auc\"],\n", + " \"headline_lift_auc\": deltas[0][\"auc\"],\n", + "}\n", + "assert_within_tolerance(\n", + " observed=observed,\n", + " target=NB02_TARGETS,\n", + " tolerances=NB02_TOLERANCES,\n", + " label=\"notebook 02 metric panel (seed 42, intermediate)\",\n", + ")\n", + "assert observed[\"headline_lift_auc\"] > 0.0, (\n", + " \"GBM(eng) − GBM(flat) AUC went non-positive — relational \"\n", + " \"lift disappeared; investigate before merging.\"\n", + ")\n", + "print(\"OK — all panel metrics within tolerance and headline lift is positive.\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell_027", + "metadata": {}, + "source": "## 8. Honest takeaway\n\nOn seed 42 the GBM(eng) − GBM(flat) AUC lift is small\n(+0.0147). Cross-seed variance for `gbm_auc` on this bundle\nis ~0.027 (see `release/validation/validation_report.json`,\n`tiers.intermediate.spreads.gbm_auc`), so a single-seed lift\nof this size is **suggestive, not conclusive**. Confirming a\nreal signal needs a seed sweep — see the cohort-shift / seed\nharness coming in PR 6.2's notebook 04.\n\nThe lift also does **not** flip the sign of the GBM-vs-LR\ncomparison: GBM(eng) is still slightly below LR(flat). This\nis the same v1 finding documented in\n`release/validation/validation_report.md` (gate **G7.4.4**)\nand the dataset card: the v1 snapshot is dominated by\nroughly-linear signal, and HistGBM doesn't consistently beat\nLR on it. Engineered relational features narrow the gap; on\nthis seed they don't yet erase it.\n\nTwo takeaways for downstream users:\n\n1. **Joins on the public bundle are leakage-safe by\n construction.** Section 3 above is the full proof. You can\n aggregate any of the four event tables without policing the\n horizon yourself.\n2. **Bring your own non-linearities.** If a feature\n engineering choice (cross-table interactions, tree\n kernels, learned embeddings, bigger seed sweeps) flips the\n GBM-vs-LR sign reliably, that's a finding worth filing —\n the *break_me_guide* template lands in PR 6.3.\n\n## Next\n\n- **Notebook 03** *(coming in PR 6.2)* — leakage and\n time-window walkthrough, including the deliberate\n `total_touches_all` trap notebook 01 keeps and this notebook\n drops.\n- **Notebook 04** *(coming in PR 6.2)* — value-aware ranking,\n calibration, and cohort-shift evaluation with a seed sweep." + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/release/notebooks/_notebook_utils.py b/release/notebooks/_notebook_utils.py new file mode 100644 index 0000000..c2baf71 --- /dev/null +++ b/release/notebooks/_notebook_utils.py @@ -0,0 +1,116 @@ +"""Shared helpers for the public release notebooks. + +The release notebooks are downloaded by Kaggle / HF consumers alongside +the parquet bundle. They cannot rely on ``leadforge`` being installed +in the consumer's environment, so the helpers needed inside the +notebooks live here as a small sibling module with no project imports. + +The metric helpers mirror ``leadforge.validation.release_quality`` so +notebook-side numbers and validation-report numbers compare apples to +apples — same ranking convention, same tie-breaking, same calibration +binning — and the ``assert_within_tolerance`` gate (G13.2) is meaningful. +""" + +from __future__ import annotations + +import math +from collections.abc import Mapping + +import numpy as np + + +def precision_at_k(scores: np.ndarray, y: np.ndarray, k: int) -> float: + """Mean label of the top-``k`` rows by descending score. + + Mirrors ``leadforge.validation.release_quality._precision_at_k``: + stable argsort so ties resolve identically to the validation report. + """ + scores = np.asarray(scores) + y = np.asarray(y) + if k <= 0 or k > len(y): + return float("nan") + order = np.argsort(-scores, kind="stable") + return float(y[order[:k]].mean()) + + +def top_decile_rate(scores: np.ndarray, y: np.ndarray) -> float: + """Precision at the top 10% of ranked scores.""" + n = len(y) + if n == 0: + return float("nan") + return precision_at_k(scores, y, max(1, int(round(n * 0.1)))) + + +def assert_within_tolerance( + observed: Mapping[str, float], + target: Mapping[str, float], + tolerances: Mapping[str, float] | float, + *, + label: str = "metrics", +) -> None: + """Assert ``|observed[k] - target[k]| <= tol`` for every key in ``target``. + + Backs the G13.2 acceptance gate inside the notebooks: once the + notebook has computed its own metrics it pins them against the + cross-seed-median values from ``release/validation/validation_report.md`` + and fails loudly if the notebook drifts out of band. + + The gate is intentionally strict about silent-pass paths: + + * Non-finite ``observed`` or ``target`` values fail (rather than + slipping through because ``NaN > tol`` evaluates ``False``). + * When ``tolerances`` is a mapping, every key in ``target`` must + have an explicit tolerance — a missing entry is treated as a + configuration error and aborts the gate up front, instead of + defaulting to ``+inf`` and silently disabling the check for + that metric. + + Args: + observed: Notebook-computed metrics, keyed by metric name. + target: Reference values from the validation report. + tolerances: Either a per-metric tolerance map (every key in + ``target`` must be present), or a single float applied to + every metric (G13.2's default is 0.05). + label: Human-readable name for the metric panel; appears in the + error message so the failing assertion identifies its source. + + Raises: + AssertionError: when any metric falls outside its tolerance, + ``observed`` is missing a key listed in ``target``, an + ``observed`` / ``target`` value is non-finite, or + ``tolerances`` is a mapping that omits a required key. + """ + if isinstance(tolerances, int | float): + per_key: Mapping[str, float] = {k: float(tolerances) for k in target} + else: + per_key = tolerances + missing_tolerances = [k for k in target if k not in per_key] + if missing_tolerances: + raise AssertionError( + f"{label}: tolerances mapping is missing entries for " + f"target metrics: {sorted(missing_tolerances)}. " + "Falling back to +inf would silently disable the gate " + "for these metrics; declare an explicit tolerance per key." + ) + failures: list[str] = [] + for key, target_value in target.items(): + if key not in observed: + failures.append(f" {key}: missing from observed metrics") + continue + observed_f = float(observed[key]) + target_f = float(target_value) + if not (math.isfinite(observed_f) and math.isfinite(target_f)): + failures.append( + f" {key}: non-finite value (observed={observed_f}, " + f"target={target_f}) — gate refuses to silently pass NaN/inf" + ) + continue + tol = float(per_key[key]) + diff = abs(observed_f - target_f) + if diff > tol: + failures.append( + f" {key}: observed={observed_f:.4f} target={target_f:.4f} " + f"|diff|={diff:.4f} > tol={tol:.4f}" + ) + if failures: + raise AssertionError(f"{label} drifted outside tolerance:\n" + "\n".join(failures)) diff --git a/release/notebooks/_release_targets.json b/release/notebooks/_release_targets.json new file mode 100644 index 0000000..7ed4cdb --- /dev/null +++ b/release/notebooks/_release_targets.json @@ -0,0 +1,10 @@ +{ + "_doc": "Cross-seed-median metric values from release/validation/validation_report.json, sliced to the metrics the release notebooks pin via assert_within_tolerance. Audited against the report by tests/release/notebooks/test_release_targets_match_report.py — if you change a value here, the test will fail unless the corresponding median in the validation report changes to match.", + "intermediate": { + "brier_score": 0.10963449613199748, + "gbm_auc": 0.875461913160326, + "lr_auc": 0.8858759553203998, + "lr_average_precision": 0.5752148545119874, + "top_decile_rate": 0.5866666666666667 + } +} diff --git a/scripts/_release_notebook_common.py b/scripts/_release_notebook_common.py new file mode 100644 index 0000000..ef25263 --- /dev/null +++ b/scripts/_release_notebook_common.py @@ -0,0 +1,101 @@ +"""Shared scaffolding for the ``scripts/build_release_notebook_*.py`` builders. + +The release-notebook builders share ~80% of their plumbing: imports, +``md`` / ``code`` cell wrappers, the metadata block, and the +``write_notebook`` step that serializes JSON and shells out to +``ruff format`` so builder output matches the project's pre-commit hook +byte-for-byte. Putting them here keeps each per-notebook builder +focused on its cell list. + +Cell IDs are assigned deterministically (``cell_000``, ``cell_001``, +...) so re-running a builder produces a byte-identical ``.ipynb`` — +audit-artifact-sync, the same invariant PR 4.1 / 5.1 / 5.2 use for +``release/`` artifacts. Nondeterministic IDs are the default in +``nbformat.v4.new_*_cell``; without an explicit override every build +diverges and the byte-equality test in +``tests/scripts/test_release_notebook_builders.py`` fails. + +Each builder exposes an ``--out PATH`` flag (see ``builder_arg_parser``) +so the byte-stability test can build into ``tmp_path`` rather than +mutating the committed ``release/notebooks/.ipynb`` in place. +That removes a pytest-xdist race and the worktree-dirtying failure +mode under interrupted runs. +""" + +from __future__ import annotations + +import argparse +import json +import subprocess +from pathlib import Path +from textwrap import dedent + +import nbformat as nbf + +_KERNEL_METADATA = { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3", + }, + "language_info": { + "name": "python", + "version": "3.11", + }, +} + + +def md(source: str) -> nbf.NotebookNode: + """Markdown cell with ``source`` dedented and surrounding blank lines stripped.""" + return nbf.v4.new_markdown_cell(dedent(source).strip("\n")) + + +def code(source: str) -> nbf.NotebookNode: + """Cleared code cell (no execution_count, no outputs).""" + cell = nbf.v4.new_code_cell(dedent(source).strip("\n")) + cell["execution_count"] = None + cell["outputs"] = [] + return cell + + +def assemble_notebook(cells: list[nbf.NotebookNode]) -> nbf.NotebookNode: + """Build a notebook from ``cells`` with deterministic cell IDs and stable metadata.""" + nb = nbf.v4.new_notebook() + for index, cell in enumerate(cells): + cell["id"] = f"cell_{index:03d}" + nb.cells = cells + nb.metadata = _KERNEL_METADATA + return nb + + +def write_notebook(out_path: Path, nb: nbf.NotebookNode) -> None: + """Write ``nb`` to ``out_path`` then run ``ruff format`` for hook-parity. + + Without the post-write format step a contributor running the builder + would see pre-commit reformat their notebook on commit, defeating the + audit-artifact-sync invariant. ``json.dumps`` parameters are pinned + so the pre-format bytes are deterministic. + """ + out_path.parent.mkdir(parents=True, exist_ok=True) + text = json.dumps(nb, indent=1, sort_keys=True, ensure_ascii=False) + out_path.write_text(text + "\n", encoding="utf-8") + subprocess.run(["ruff", "format", str(out_path)], check=True) # noqa: S603, S607 + print(f"wrote {out_path}") + + +def builder_arg_parser(*, default_out: Path, description: str) -> argparse.ArgumentParser: + """Return the argparse parser shared by every notebook builder. + + Exposes ``--out PATH`` so callers (notably the byte-stability test) + can redirect the build into a temp directory. Defaults to the + canonical committed ``release/notebooks/.ipynb`` path so the + no-arg invocation contributors run today keeps working. + """ + parser = argparse.ArgumentParser(description=description) + parser.add_argument( + "--out", + type=Path, + default=default_out, + help=f"Path to write the notebook to (default: {default_out})", + ) + return parser diff --git a/scripts/build_public_release.py b/scripts/build_public_release.py index f597d7b..5ccb152 100644 --- a/scripts/build_public_release.py +++ b/scripts/build_public_release.py @@ -125,6 +125,16 @@ def main() -> None: "Default: wall-clock now. Use this for reproducible bundles." ), ) + parser.add_argument( + "--tier", + choices=[name for name, _, _ in BUNDLES] + ["all"], + default="all", + help=( + "Build only the named tier (default: all four). " + "Useful for CI jobs that need a single bundle — e.g. the " + "release-notebooks job only needs ``intermediate``." + ), + ) args = parser.parse_args() output_root = Path(args.output_dir) @@ -135,7 +145,8 @@ def main() -> None: if license_src.exists(): shutil.copy2(license_src, output_root / "LICENSE") - for dir_name, exposure_mode, difficulty in BUNDLES: + selected = BUNDLES if args.tier == "all" else [b for b in BUNDLES if b[0] == args.tier] + for dir_name, exposure_mode, difficulty in selected: bundle_dir = output_root / dir_name print(f"Generating {dir_name} ({exposure_mode}, {difficulty})...", file=sys.stderr) generate_and_save( @@ -162,7 +173,7 @@ def main() -> None: # Summary print("\n=== Release Summary ===") - for dir_name, _, _ in BUNDLES: + for dir_name, _, _ in selected: bundle_dir = output_root / dir_name print_summary(bundle_dir, dir_name) diff --git a/scripts/build_release_notebook_01.py b/scripts/build_release_notebook_01.py new file mode 100644 index 0000000..fc6b14f --- /dev/null +++ b/scripts/build_release_notebook_01.py @@ -0,0 +1,437 @@ +"""One-shot builder for ``release/notebooks/01_baseline_lead_scoring.ipynb``. + +Run from the repository root:: + + python scripts/build_release_notebook_01.py + +Cells are assigned deterministic IDs by ``_release_notebook_common`` so +re-running yields a byte-identical file — same audit-artifact-sync +pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts. The byte +equality is enforced in CI by ``tests/scripts/test_release_notebook_builders.py``. +""" + +from __future__ import annotations + +import sys +from pathlib import Path + +# Make ``scripts/`` importable regardless of how this file is loaded +# (mirrors ``scripts/package_hf_release.py``'s sys.path dance). +sys.path.insert(0, str(Path(__file__).resolve().parent)) + +import nbformat as nbf # noqa: E402 — must follow sys.path insert +from _release_notebook_common import ( # noqa: E402 — must follow sys.path insert + assemble_notebook, + builder_arg_parser, + code, + md, + write_notebook, +) + +DEFAULT_OUT = ( + Path(__file__).resolve().parents[1] / "release" / "notebooks" / "01_baseline_lead_scoring.ipynb" +) + + +def cells() -> list[nbf.NotebookNode]: + return [ + md( + """ + # Notebook 01 — Baseline Lead Scoring + + **Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier (the + release default). + + **Goal:** train Logistic Regression and Histogram Gradient Boosting + baselines on the snapshot-safe public bundle, and verify they + reproduce the cross-seed-median metrics in + [`release/validation/validation_report.md`](../validation/validation_report.md) + within the per-metric tolerances fixed by acceptance gate **G13.2**. + + **Public path discipline (G13.3).** This notebook reads only from + the public bundle at `release/intermediate/`. The instructor + companion (`release/intermediate_instructor/`, with full-horizon + event tables, the latent registry, the hidden DAG, and the + mechanism summary) is **not** loaded — public modelling work must + never depend on instructor-only artefacts. + """ + ), + md("## 1. Setup"), + code( + """ + from __future__ import annotations + + import json + import sys + from pathlib import Path + + import numpy as np + import pandas as pd + from sklearn.compose import ColumnTransformer + from sklearn.ensemble import HistGradientBoostingClassifier + from sklearn.impute import SimpleImputer + from sklearn.linear_model import LogisticRegression + from sklearn.metrics import ( + average_precision_score, + brier_score_loss, + roc_auc_score, + ) + from sklearn.pipeline import Pipeline + from sklearn.preprocessing import OneHotEncoder, StandardScaler + + sys.path.insert(0, str(Path.cwd())) + from _notebook_utils import ( + assert_within_tolerance, + precision_at_k, + top_decile_rate, + ) + + SEED = 42 + BUNDLE = Path("../intermediate") # public student bundle + TASK = "converted_within_90_days" + """ + ), + md( + """ + ## 2. Reproduction targets + + We pin the cross-seed-median metrics for the *intermediate* tier + (seeds 42–46) from `release/validation/validation_report.json`. + The targets live in a sibling file + (`release/notebooks/_release_targets.json`) so they can't drift + from the validation report without an audit-sync test failure + in CI. + + **Per-metric tolerances** are tighter than a flat 5 % band: the + cross-seed standard deviation in the report is well under 0.02 + on AUC and Brier, and a flat ±0.05 would let a regression slip + through. Average-precision and the small-`k` `top_decile_rate` + stay at ±0.05 because their seed-to-seed variance is larger. + """ + ), + code( + """ + with (Path.cwd() / "_release_targets.json").open() as fh: + targets = json.load(fh)["intermediate"] + + # Re-key the validation report's metric names into the metric + # names this notebook prints below, so the gate compares apples + # to apples. + VALIDATION_REPORT_TARGETS = { + "lr_auc": targets["lr_auc"], + "gbm_auc": targets["gbm_auc"], + "lr_average_precision": targets["lr_average_precision"], + "lr_brier": targets["brier_score"], + "lr_top_decile_rate": targets["top_decile_rate"], + } + TOLERANCES = { + "lr_auc": 0.02, # G13.2 — tighter than a flat 5% + "gbm_auc": 0.02, + "lr_average_precision": 0.05, # higher seed variance + "lr_brier": 0.02, + "lr_top_decile_rate": 0.05, # small-k variance + } + for k, v in VALIDATION_REPORT_TARGETS.items(): + print(f" target {k:<24s} {v:.4f} (tol ±{TOLERANCES[k]:.2f})") + """ + ), + md( + """ + ## 3. Load the bundle + + We load the parquet task splits — the canonical format the + release ships in. The accompanying `lead_scoring.csv` is a + convenience export with the same rows but coerced dtypes; + sticking with parquet preserves nullable `Int64` / `Float64` / + `boolean` columns the way the validator sees them. + """ + ), + code( + """ + train = pd.read_parquet(BUNDLE / "tasks" / TASK / "train.parquet") + valid = pd.read_parquet(BUNDLE / "tasks" / TASK / "valid.parquet") + test = pd.read_parquet(BUNDLE / "tasks" / TASK / "test.parquet") + + with (BUNDLE / "manifest.json").open() as fh: + manifest = json.load(fh) + + assert manifest["exposure_mode"] == "student_public", ( + "this notebook expects the public bundle" + ) + assert manifest["relational_snapshot_safe"] is True + + print(f"Train: {len(train):,} rows") + print(f"Valid: {len(valid):,} rows (held out — not used here)") + print(f"Test: {len(test):,} rows") + print() + print(f"Bundle exposure_mode: {manifest['exposure_mode']}") + print(f"Bundle snapshot_day: {manifest['snapshot_day']}") + print(f"Bundle horizon_days: {manifest['horizon_days']}") + print() + print("Conversion rates:") + for name, df in [("train", train), ("valid", valid), ("test", test)]: + print(f" {name}: {df[TASK].mean():.1%}") + """ + ), + md( + """ + ## 4. Feature selection + + We use the **same feature set as `release/validation/validation_report.json`** + so the gate in section 7 is a real reproduction check rather + than a related-but-different number. That means we drop only + the IDs and the label — every other column in `train` (including + `total_touches_all`, the documented leakage trap) goes into the + pipeline. + + **About `total_touches_all`.** The feature dictionary flags it + with `leakage_risk = True`: it counts touches over the full + 90-day horizon, which is post-snapshot data. The validation + report keeps it in the panel anyway because (a) its standalone + AUC is barely above 0.55 (see the *post_snapshot_aggregates* + baseline column in the report) and (b) the report exists to + measure the v1 dataset's *as-shipped* difficulty, leakage trap + included. **Notebook 03** *(coming in PR 6.2)* walks through + what dropping the trap does to performance and how to detect + similar traps from feature audits alone. + """ + ), + code( + """ + feat_dict = pd.read_csv(BUNDLE / "feature_dictionary.csv") + trap_cols = feat_dict.loc[ + feat_dict["leakage_risk"].astype(bool), "name" + ].tolist() + ID_COLS = ["account_id", "contact_id", "lead_id", "lead_created_at"] + # Mirrors ``release_quality._partition_columns`` — IDs + label only. + EXCLUDE = set(ID_COLS + [TASK]) + + feature_cols = [c for c in train.columns if c not in EXCLUDE] + cat_cols = [ + c + for c in feature_cols + if not ( + pd.api.types.is_bool_dtype(train[c]) + or pd.api.types.is_numeric_dtype(train[c]) + ) + ] + num_cols = [c for c in feature_cols if c not in cat_cols] + + print(f"Leakage-trap columns kept (see narrative above): {trap_cols}") + print(f"Categorical features ({len(cat_cols)}): {cat_cols}") + print(f"Numeric features ({len(num_cols)}): {num_cols}") + """ + ), + md( + """ + ## 5. Preprocessing pipeline + + Mirrors `leadforge.validation.release_quality._build_pipeline` + so the notebook's metric panel and the validation report's + metric panel agree by construction: + + - numeric: median-impute, then `StandardScaler` + - categorical: most-frequent-impute, then dense `OneHotEncoder` + with `handle_unknown="ignore"` + """ + ), + code( + """ + def _sanitize_categoricals(df: pd.DataFrame) -> pd.DataFrame: + out = df.copy() + for c in cat_cols: + out[c] = out[c].astype(object).where(out[c].notna(), None) + return out + + x_train = _sanitize_categoricals(train[feature_cols]) + x_test = _sanitize_categoricals(test[feature_cols]) + y_train = train[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + y_test = test[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + + numeric_t = Pipeline( + [("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())] + ) + categorical_t = Pipeline( + [ + ("imputer", SimpleImputer(strategy="most_frequent")), + ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)), + ] + ) + preprocessor = ColumnTransformer( + transformers=[ + ("num", numeric_t, num_cols), + ("cat", categorical_t, cat_cols), + ], + remainder="drop", + ) + """ + ), + md("## 6. Train baselines and score the test split"), + code( + """ + lr_pipe = Pipeline( + [ + ("preprocessor", preprocessor), + ( + "classifier", + LogisticRegression( + max_iter=1000, solver="lbfgs", random_state=SEED + ), + ), + ] + ) + gbm_pipe = Pipeline( + [ + ("preprocessor", preprocessor), + ("classifier", HistGradientBoostingClassifier(random_state=SEED)), + ] + ) + + lr_pipe.fit(x_train, y_train) + gbm_pipe.fit(x_train, y_train) + + lr_probs = lr_pipe.predict_proba(x_test)[:, 1] + gbm_probs = gbm_pipe.predict_proba(x_test)[:, 1] + + metrics = { + "lr_auc": float(roc_auc_score(y_test, lr_probs)), + "gbm_auc": float(roc_auc_score(y_test, gbm_probs)), + "lr_average_precision": float(average_precision_score(y_test, lr_probs)), + "lr_brier": float(brier_score_loss(y_test, lr_probs)), + "lr_top_decile_rate": top_decile_rate(lr_probs, y_test), + # Print-only; not pinned (the validation report tracks + # ``top_decile_rate`` instead, which we gate above). + "lr_precision_at_50": precision_at_k(lr_probs, y_test, 50), + "lr_precision_at_100": precision_at_k(lr_probs, y_test, 100), + "lr_precision_at_200": precision_at_k(lr_probs, y_test, 200), + } + for k, v in metrics.items(): + print(f" {k:<24s} {v:.4f}") + """ + ), + md( + """ + ## 7. Tolerance check (G13.2) + + The notebook's printed metrics must match the cross-seed medians + in `validation_report.json` to within the per-metric tolerances + declared in section 2. If a future change breaks this, the + assertion below fails — and CI catches it, because the same + cell runs under `nbclient` in the `notebooks` job. + """ + ), + code( + """ + assert_within_tolerance( + observed=metrics, + target=VALIDATION_REPORT_TARGETS, + tolerances=TOLERANCES, + label="notebook 01 vs validation_report.json (intermediate tier)", + ) + print("OK — all gated metrics are within their per-metric tolerance.") + """ + ), + md( + """ + ## 8. Decile lift chart + + Standard sanity-check for ranking quality: sort the test set by + score, bucket into deciles, plot the per-decile conversion rate + vs the base rate. + """ + ), + code( + """ + import matplotlib.pyplot as plt + + order = np.argsort(-lr_probs, kind="stable") + y_sorted = y_test[order] + n = len(y_test) + edges = np.linspace(0, n, 11, dtype=int) + decile_rate = np.array( + [y_sorted[edges[i] : edges[i + 1]].mean() for i in range(10)] + ) + base_rate = y_test.mean() + + fig, ax = plt.subplots(figsize=(7, 4)) + ax.bar(range(1, 11), decile_rate, color="#3b82f6") + ax.axhline( + base_rate, + color="#ef4444", + linestyle="--", + label=f"base rate ({base_rate:.1%})", + ) + ax.set_xticks(range(1, 11)) + ax.set_xlabel("Score decile (1 = highest)") + ax.set_ylabel("Conversion rate") + ax.set_title("LR decile lift — intermediate tier (seed 42)") + ax.legend(loc="upper right") + plt.tight_layout() + plt.show() + """ + ), + md( + """ + ## 9. Calibration plot + + Reliability diagram: bin predicted probabilities into 10 equal- + width buckets, plot mean predicted vs mean observed. The + validation report's reference reliability plot for the + intermediate tier lives at + `release/validation/figures/calibration_intermediate.png`. + """ + ), + code( + """ + edges = np.linspace(0.0, 1.0, 11) + mean_pred = [] + mean_actual = [] + for i in range(10): + lo, hi = edges[i], edges[i + 1] + mask = (lr_probs >= lo) & ((lr_probs <= hi) if i == 9 else (lr_probs < hi)) + if mask.sum() == 0: + continue + mean_pred.append(lr_probs[mask].mean()) + mean_actual.append(y_test[mask].mean()) + + fig, ax = plt.subplots(figsize=(5, 5)) + ax.plot([0, 1], [0, 1], color="#9ca3af", linestyle="--", label="perfect calibration") + ax.plot(mean_pred, mean_actual, marker="o", color="#3b82f6", label="LR") + ax.set_xlim(0, 1) + ax.set_ylim(0, 1) + ax.set_xlabel("Mean predicted probability") + ax.set_ylabel("Observed conversion rate") + ax.set_title("Calibration — LR, intermediate tier (seed 42)") + ax.legend(loc="upper left") + plt.tight_layout() + plt.show() + """ + ), + md( + """ + ## 10. Next + + - **Notebook 02** — engineer features by joining the snapshot- + safe relational tables under `release/intermediate/tables/`, + then measure the lift over the flat-CSV LR baseline above. + - **Notebook 03** *(coming in PR 6.2)* — leakage and time-window + walkthrough; works through what `total_touches_all` does to + your AUC if you forget to drop it. + - **Notebook 04** *(coming in PR 6.2)* — value-aware ranking + (`expected_acv` × P(convert)), threshold selection, and the + cohort-shift stress test. + """ + ), + ] + + +def main() -> None: + args = builder_arg_parser( + default_out=DEFAULT_OUT, + description="Build release/notebooks/01_baseline_lead_scoring.ipynb", + ).parse_args() + write_notebook(args.out, assemble_notebook(cells())) + + +if __name__ == "__main__": + main() diff --git a/scripts/build_release_notebook_02.py b/scripts/build_release_notebook_02.py new file mode 100644 index 0000000..ea2cf67 --- /dev/null +++ b/scripts/build_release_notebook_02.py @@ -0,0 +1,663 @@ +"""One-shot builder for ``release/notebooks/02_relational_feature_engineering.ipynb``. + +Run from the repository root:: + + python scripts/build_release_notebook_02.py + +Cells are assigned deterministic IDs by ``_release_notebook_common`` so +re-running yields a byte-identical file — same audit-artifact-sync +pattern PR 4.1 / 5.1 / 5.2 use for ``release/`` artifacts. +""" + +from __future__ import annotations + +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).resolve().parent)) + +import nbformat as nbf # noqa: E402 — must follow sys.path insert +from _release_notebook_common import ( # noqa: E402 — must follow sys.path insert + assemble_notebook, + builder_arg_parser, + code, + md, + write_notebook, +) + +DEFAULT_OUT = ( + Path(__file__).resolve().parents[1] + / "release" + / "notebooks" + / "02_relational_feature_engineering.ipynb" +) + + +def cells() -> list[nbf.NotebookNode]: + return [ + md( + """ + # Notebook 02 — Relational Feature Engineering + + **Dataset:** `leadforge-lead-scoring-v1`, *intermediate* tier. + + The flat task split in notebook 01 is one snapshot view of a + richer relational world. The public bundle ships seven + **snapshot-safe** tables under `release/intermediate/tables/`: + every event row is filtered to `timestamp <= lead_created_at + + snapshot_day`, so any join you can write is **leakage-safe by + construction**. + + This notebook walks through: + + 1. Loading the seven public tables. + 2. Verifying the snapshot-safe contract inline (the teachable + moment — see the contract pass, don't just read about it). + 3. Engineering four relational features, with train-only + discipline for any aggregation that crosses splits. + 4. Training a HistGBM on `flat ∪ engineered` columns. + 5. Reporting the AUC / AP / Brier / P@K delta vs the flat-CSV + baseline, with a tolerance gate that fails CI if the + headline lift regresses. + + **Public path discipline (G13.3).** This notebook reads only + from `release/intermediate/`. The instructor companion + (`release/intermediate_instructor/`) is **not** loaded — + relational feature engineering must work from the public + artefact alone. Tables omitted from the public bundle on + purpose (`customers`, `subscriptions`) live only in the + instructor companion because their mere existence reconstructs + the label. + + **Leakage-trap discipline.** Unlike notebook 01 (which + reproduces the validation report's panel verbatim and + therefore keeps `total_touches_all`), notebook 02 **drops** + the trap from the flat baseline. We're teaching feature + engineering here; mixing a known-leaky column into the + "before" panel would muddy the relational lift attribution. + """ + ), + md("## 1. Setup"), + code( + """ + from __future__ import annotations + + import json + import sys + from pathlib import Path + + import numpy as np + import pandas as pd + from sklearn.compose import ColumnTransformer + from sklearn.ensemble import HistGradientBoostingClassifier + from sklearn.impute import SimpleImputer + from sklearn.linear_model import LogisticRegression + from sklearn.metrics import ( + average_precision_score, + brier_score_loss, + roc_auc_score, + ) + from sklearn.pipeline import Pipeline + from sklearn.preprocessing import OneHotEncoder, StandardScaler + + sys.path.insert(0, str(Path.cwd())) + from _notebook_utils import assert_within_tolerance, precision_at_k + + SEED = 42 + BUNDLE = Path("../intermediate") # public student bundle + TASK = "converted_within_90_days" + + with (BUNDLE / "manifest.json").open() as fh: + manifest = json.load(fh) + assert manifest["exposure_mode"] == "student_public" + assert manifest["relational_snapshot_safe"] is True + SNAPSHOT_DAY = int(manifest["snapshot_day"]) + print(f"snapshot_day = {SNAPSHOT_DAY}") + """ + ), + md( + """ + ## 2. Load the seven public tables + + These are the only tables present under `release/intermediate/ + tables/`. `customers` and `subscriptions` are deliberately + absent — they only exist for *converted* leads, so their + presence in a public bundle would reconstruct the label. + """ + ), + code( + """ + PUBLIC_TABLES = ( + "accounts", + "contacts", + "leads", + "touches", + "sessions", + "sales_activities", + "opportunities", + ) + tables = { + name: pd.read_parquet(BUNDLE / "tables" / f"{name}.parquet") + for name in PUBLIC_TABLES + } + for name in PUBLIC_TABLES: + print(f" {name:<20s} {tables[name].shape}") + + train = pd.read_parquet(BUNDLE / "tasks" / TASK / "train.parquet") + test = pd.read_parquet(BUNDLE / "tasks" / TASK / "test.parquet") + """ + ), + md( + """ + ## 3. Verify the snapshot-safe contract + + For every event-table row joined back to its lead, assert + `timestamp <= lead_created_at + snapshot_day`. The reported + **headroom** is the *minimum* gap any event row leaves between + its timestamp and the snapshot cutoff — a non-negative number + when the contract holds. Showing the actual minimum (rather + than just "all <= cutoff") makes the receipt honest: if a + future regeneration ever shaves the contract close, you'll see + the headroom shrink before the assertion fires. + """ + ), + code( + """ + leads_index = tables["leads"][["lead_id", "lead_created_at"]].copy() + leads_index["lead_created_at"] = pd.to_datetime(leads_index["lead_created_at"]) + cutoff = leads_index.assign( + cutoff=leads_index["lead_created_at"] + pd.Timedelta(days=SNAPSHOT_DAY) + ) + + EVENT_TABLES = [ + ("touches", "touch_timestamp"), + ("sessions", "session_timestamp"), + ("sales_activities", "activity_timestamp"), + ("opportunities", "created_at"), + ] + + for tbl, ts_col in EVENT_TABLES: + df = tables[tbl][["lead_id", ts_col]].merge( + cutoff[["lead_id", "cutoff"]], on="lead_id", how="left" + ) + df[ts_col] = pd.to_datetime(df[ts_col]) + violations = df[df[ts_col] > df["cutoff"]] + assert len(violations) == 0, ( + f"{tbl}.{ts_col}: {len(violations)} rows past snapshot cutoff" + ) + min_headroom = (df["cutoff"] - df[ts_col]).min() + min_headroom_days = float(min_headroom.total_seconds()) / 86400.0 + print( + f" {tbl}.{ts_col}: {len(df):>6,} rows; " + f"min headroom under cutoff: {min_headroom_days:6.2f} days" + ) + + print() + print("OK — every event row in every public event table satisfies") + print(f" timestamp <= lead_created_at + {SNAPSHOT_DAY} days.") + """ + ), + md( + """ + ## 4. Engineered features + + We build four relational features. Each one starts as a + per-lead aggregate over a single snapshot-safe table, then + joins back into the per-lead snapshot. Aggregations that pool + across leads (account-level density, target encoding) are + **fit on the train split only** and applied to test via + join — same train-only discipline that prevents target leakage + in mean encoding. + + | # | Feature | Source table(s) | Aggregation | + |---|---|---|---| + | 1 | `touches_ch_*` (3 cols) | `touches` | per-lead × `touch_channel` count | + | 2 | `account_avg_touches_per_lead` | `touches`, `leads`, train lead set | account-level rollup over **train leads only**, then merge back | + | 3 | `days_since_last_activity` | `sales_activities`, `leads` | per-lead recency at snapshot cutoff | + | 4 | `industry_target_encoding_train` | `accounts`, train | mean-target encoding **fit on train only** | + """ + ), + md("### 4.1 Touch-channel breakdown"), + code( + """ + touches = tables["touches"] + channel_counts = ( + touches.groupby(["lead_id", "touch_channel"]).size().unstack(fill_value=0) + ) + channel_counts.columns = [f"touches_ch_{c}" for c in channel_counts.columns] + print(f"channel feature columns: {list(channel_counts.columns)}") + channel_counts.head() + """ + ), + md( + """ + ### 4.2 Account-level touch density + + How active is the account this lead belongs to, on average? An + account with many leads and many touches per lead is a + structurally different prospect than a one-touch account. + + **Train-only.** We compute the account-level rollup using only + train leads' touches. Test leads in train-only-empty accounts + fall back to 0 via `fillna`. This avoids letting test rows + influence the train feature distribution — same discipline + applied to mean-target encoding in 4.4. + """ + ), + code( + """ + train_lead_ids = set(train["lead_id"].tolist()) + tch_with_acct = touches.merge( + tables["leads"][["lead_id", "account_id"]], on="lead_id", how="left" + ) + tch_train = tch_with_acct[tch_with_acct["lead_id"].isin(train_lead_ids)] + account_density = ( + tch_train.groupby("account_id") + .agg( + account_total_touches=("touch_id", "count"), + account_lead_count=("lead_id", "nunique"), + ) + .assign( + account_avg_touches_per_lead=lambda d: ( + d["account_total_touches"] / d["account_lead_count"] + ) + ) + .reset_index() + ) + print( + f"account_density rows: {len(account_density):,} " + f"(accounts represented in train touches)" + ) + account_density.head() + """ + ), + md( + """ + ### 4.3 Sales-activity recency at snapshot + + Days between the lead's most recent sales activity and the + snapshot cutoff (`lead_created_at + snapshot_day`). Recency + is a classic engagement signal that's surprisingly hard to + recover from the flat snapshot directly. + """ + ), + code( + """ + sa = tables["sales_activities"][["lead_id", "activity_timestamp"]].copy() + sa["activity_timestamp"] = pd.to_datetime(sa["activity_timestamp"]) + last_activity = ( + sa.groupby("lead_id")["activity_timestamp"] + .max() + .reset_index() + .rename(columns={"activity_timestamp": "last_activity_at"}) + ) + last_activity = last_activity.merge(cutoff[["lead_id", "cutoff"]], on="lead_id") + last_activity["days_since_last_activity"] = ( + last_activity["cutoff"] - last_activity["last_activity_at"] + ).dt.total_seconds() / 86400 + last_activity[["lead_id", "days_since_last_activity"]].head() + """ + ), + md( + """ + ### 4.4 Industry target encoding (train-only, leakage-safe) + + Replace the `industry` string with the conversion rate observed + for that industry **on the training split only**. Computing + the encoding on test leaks the test labels into the features — + a textbook mistake; we avoid it explicitly. + """ + ), + code( + """ + tgt_enc = train.groupby("industry")[TASK].mean().to_dict() + tgt_enc_global_mean = float(train[TASK].mean()) + print(f"per-industry train conversion rates ({len(tgt_enc)} industries):") + for k, v in sorted(tgt_enc.items()): + print(f" {k:<20s} {v:.3f}") + print(f" fallback global mean: {tgt_enc_global_mean:.3f}") + """ + ), + md("### 4.5 Stitch features onto train and test"), + code( + """ + ENGINEERED_NUMERIC = ( + list(channel_counts.columns) + + [ + "account_avg_touches_per_lead", + "days_since_last_activity", + "industry_target_encoding_train", + ] + ) + + def attach_engineered(df: pd.DataFrame) -> pd.DataFrame: + out = df.copy() + out = out.merge(channel_counts, on="lead_id", how="left") + for col in channel_counts.columns: + out[col] = out[col].fillna(0).astype(int) + out = out.merge( + account_density[["account_id", "account_avg_touches_per_lead"]], + on="account_id", + how="left", + ) + out["account_avg_touches_per_lead"] = ( + out["account_avg_touches_per_lead"].fillna(0).astype(float) + ) + out = out.merge( + last_activity[["lead_id", "days_since_last_activity"]], + on="lead_id", + how="left", + ) + out["industry_target_encoding_train"] = ( + out["industry"].map(tgt_enc).fillna(tgt_enc_global_mean) + ) + return out + + train_eng = attach_engineered(train) + test_eng = attach_engineered(test) + print( + f"train_eng shape: {train_eng.shape} " + f"({train.shape[1]} -> {train_eng.shape[1]} cols)" + ) + print(f"new columns: {ENGINEERED_NUMERIC}") + """ + ), + md( + """ + ## 5. Baseline + engineered models + + Same pipeline as notebook 01 (mirrors + `leadforge.validation.release_quality._build_pipeline`). We + train four models so the comparison is fair: + + | Model | Features | Compares against | + |---|---|---| + | LR-flat | flat snapshot, no trap | (validation report baseline) | + | GBM-flat | flat snapshot, no trap | LR-flat | + | LR-eng | flat ∪ engineered, no trap, no raw `industry` | LR-flat | + | GBM-eng | flat ∪ engineered, no trap, no raw `industry` | GBM-flat — the headline lift | + + The `+rel` pipelines drop the raw `industry` categorical + because the train-only target encoding already represents it + as a numeric column — feeding both would feed the same column + twice. + """ + ), + code( + """ + feat_dict = pd.read_csv(BUNDLE / "feature_dictionary.csv") + trap_cols = feat_dict.loc[ + feat_dict["leakage_risk"].astype(bool), "name" + ].tolist() + ID_COLS = ["account_id", "contact_id", "lead_id", "lead_created_at"] + EXCLUDE = set(ID_COLS + trap_cols + [TASK]) + + base_cols = [c for c in train.columns if c not in EXCLUDE] + cat_base = [ + c + for c in base_cols + if not ( + pd.api.types.is_bool_dtype(train[c]) + or pd.api.types.is_numeric_dtype(train[c]) + ) + ] + num_base = [c for c in base_cols if c not in cat_base] + # Drop raw `industry` from the +rel categorical list so the LR + # pipeline doesn't see it twice (one-hot + target-encoded). + cat_eng = [c for c in cat_base if c != "industry"] + print( + f"flat features: {len(base_cols)} " + f"(numeric={len(num_base)}, categorical={len(cat_base)})" + ) + print(f"engineered (numeric only): {len(ENGINEERED_NUMERIC)}") + print(f"+rel categorical list drops: {set(cat_base) - set(cat_eng)}") + """ + ), + code( + """ + def _sanitize(df: pd.DataFrame, cat_cols: list[str]) -> pd.DataFrame: + out = df.copy() + for c in cat_cols: + out[c] = out[c].astype(object).where(out[c].notna(), None) + return out + + def build_pipeline(num_cols: list[str], cat_cols: list[str], *, model: str) -> Pipeline: + numeric_t = Pipeline( + [ + ("imputer", SimpleImputer(strategy="median")), + ("scaler", StandardScaler()), + ] + ) + cat_t = Pipeline( + [ + ("imputer", SimpleImputer(strategy="most_frequent")), + ( + "encoder", + OneHotEncoder(handle_unknown="ignore", sparse_output=False), + ), + ] + ) + pre = ColumnTransformer( + [("num", numeric_t, num_cols), ("cat", cat_t, cat_cols)], + remainder="drop", + ) + if model == "lr": + clf = LogisticRegression(max_iter=1000, solver="lbfgs", random_state=SEED) + else: + clf = HistGradientBoostingClassifier(random_state=SEED) + return Pipeline([("preprocessor", pre), ("classifier", clf)]) + + def fit_and_score( + x_train_df: pd.DataFrame, + x_test_df: pd.DataFrame, + num_cols: list[str], + cat_cols: list[str], + *, + model: str, + ) -> np.ndarray: + pipe = build_pipeline(num_cols, cat_cols, model=model) + pipe.fit(_sanitize(x_train_df, cat_cols), y_train) + return pipe.predict_proba(_sanitize(x_test_df, cat_cols))[:, 1] + + y_train = train[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + y_test = test[TASK].astype("boolean").fillna(False).astype(int).to_numpy() + base_rate = float(y_test.mean()) + """ + ), + code( + """ + num_eng = num_base + ENGINEERED_NUMERIC + + probs_lr_flat = fit_and_score( + train[base_cols], test[base_cols], num_base, cat_base, model="lr" + ) + probs_gbm_flat = fit_and_score( + train[base_cols], test[base_cols], num_base, cat_base, model="gbm" + ) + probs_lr_eng = fit_and_score( + train_eng[base_cols + ENGINEERED_NUMERIC], + test_eng[base_cols + ENGINEERED_NUMERIC], + num_eng, cat_eng, model="lr", + ) + probs_gbm_eng = fit_and_score( + train_eng[base_cols + ENGINEERED_NUMERIC], + test_eng[base_cols + ENGINEERED_NUMERIC], + num_eng, cat_eng, model="gbm", + ) + """ + ), + code( + """ + def panel(probs: np.ndarray, label: str) -> dict[str, float]: + return { + "label": label, + "auc": float(roc_auc_score(y_test, probs)), + "ap": float(average_precision_score(y_test, probs)), + "brier": float(brier_score_loss(y_test, probs)), + "p@100": precision_at_k(probs, y_test, 100), + } + + rows = [ + panel(probs_lr_flat, "LR flat "), + panel(probs_gbm_flat, "GBM flat "), + panel(probs_lr_eng, "LR flat+rel "), + panel(probs_gbm_eng, "GBM flat+rel "), + ] + print(f"base rate: {base_rate:.3f}") + print() + print(f"{'model':<18s} {'AUC':>7s} {'AP':>7s} {'Brier':>7s} {'P@100':>7s}") + for r in rows: + print( + f"{r['label']:<18s} {r['auc']:.4f} {r['ap']:.4f} " + f"{r['brier']:.4f} {r['p@100']:.4f}" + ) + """ + ), + md( + """ + ## 6. Lift over the flat baseline + + *Sign convention.* `ΔAUC`, `ΔAP`, `ΔP@100`: **higher is + better** (positive = engineered features helped). `ΔBrier`: + **lower is better** (Brier is a loss, so negative ΔBrier = + improved calibration). + """ + ), + code( + """ + def delta(eng: np.ndarray, base: np.ndarray, name: str) -> dict[str, float]: + return { + "label": name, + "auc": float(roc_auc_score(y_test, eng) - roc_auc_score(y_test, base)), + "ap": float( + average_precision_score(y_test, eng) - average_precision_score(y_test, base) + ), + "brier": float( + brier_score_loss(y_test, eng) - brier_score_loss(y_test, base) + ), + "p@100": float( + precision_at_k(eng, y_test, 100) - precision_at_k(base, y_test, 100) + ), + } + + deltas = [ + delta(probs_gbm_eng, probs_gbm_flat, "GBM(eng) - GBM(flat)"), + delta(probs_gbm_eng, probs_lr_flat, "GBM(eng) - LR(flat) "), + delta(probs_lr_eng, probs_lr_flat, "LR(eng) - LR(flat) "), + ] + print(f"{'comparison':<22s} {'ΔAUC':>8s} {'ΔAP':>8s} {'ΔBrier':>8s} {'ΔP@100':>8s}") + for d in deltas: + print( + f"{d['label']:<22s} {d['auc']:+8.4f} {d['ap']:+8.4f} " + f"{d['brier']:+8.4f} {d['p@100']:+8.4f}" + ) + """ + ), + md( + """ + ## 7. Tolerance gate + + Pin the four model AUCs and the headline lift to per-metric + tolerances so a regression in any pipeline component (feature + engineering, leakage discipline, sklearn upgrade) breaks CI. + Targets and tolerances are seed-42-specific by design — this + notebook is reproducible-by-seed, not a cross-seed median. + """ + ), + code( + """ + # Single-seed (seed=42) AUCs observed on the as-shipped + # intermediate bundle. Tolerances allow ±0.02 around each + # baseline (well outside numerical jitter, well inside the + # band that would let GBM(eng) silently drop below GBM(flat)). + NB02_TARGETS = { + "lr_flat_auc": 0.8737, + "gbm_flat_auc": 0.8432, + "lr_eng_auc": 0.8763, + "gbm_eng_auc": 0.8579, + "headline_lift_auc": 0.0147, # GBM(eng) - GBM(flat) + } + NB02_TOLERANCES = { + "lr_flat_auc": 0.02, + "gbm_flat_auc": 0.02, + "lr_eng_auc": 0.02, + "gbm_eng_auc": 0.02, + "headline_lift_auc": 0.015, # tighter — sign-aware below + } + + observed = { + "lr_flat_auc": rows[0]["auc"], + "gbm_flat_auc": rows[1]["auc"], + "lr_eng_auc": rows[2]["auc"], + "gbm_eng_auc": rows[3]["auc"], + "headline_lift_auc": deltas[0]["auc"], + } + assert_within_tolerance( + observed=observed, + target=NB02_TARGETS, + tolerances=NB02_TOLERANCES, + label="notebook 02 metric panel (seed 42, intermediate)", + ) + assert observed["headline_lift_auc"] > 0.0, ( + "GBM(eng) − GBM(flat) AUC went non-positive — relational " + "lift disappeared; investigate before merging." + ) + print("OK — all panel metrics within tolerance and headline lift is positive.") + """ + ), + md( + """ + ## 8. Honest takeaway + + On seed 42 the GBM(eng) − GBM(flat) AUC lift is small + (+0.0147). Cross-seed variance for `gbm_auc` on this bundle + is ~0.027 (see `release/validation/validation_report.json`, + `tiers.intermediate.spreads.gbm_auc`), so a single-seed lift + of this size is **suggestive, not conclusive**. Confirming a + real signal needs a seed sweep — see the cohort-shift / seed + harness coming in PR 6.2's notebook 04. + + The lift also does **not** flip the sign of the GBM-vs-LR + comparison: GBM(eng) is still slightly below LR(flat). This + is the same v1 finding documented in + `release/validation/validation_report.md` (gate **G7.4.4**) + and the dataset card: the v1 snapshot is dominated by + roughly-linear signal, and HistGBM doesn't consistently beat + LR on it. Engineered relational features narrow the gap; on + this seed they don't yet erase it. + + Two takeaways for downstream users: + + 1. **Joins on the public bundle are leakage-safe by + construction.** Section 3 above is the full proof. You can + aggregate any of the four event tables without policing the + horizon yourself. + 2. **Bring your own non-linearities.** If a feature + engineering choice (cross-table interactions, tree + kernels, learned embeddings, bigger seed sweeps) flips the + GBM-vs-LR sign reliably, that's a finding worth filing — + the *break_me_guide* template lands in PR 6.3. + + ## Next + + - **Notebook 03** *(coming in PR 6.2)* — leakage and + time-window walkthrough, including the deliberate + `total_touches_all` trap notebook 01 keeps and this notebook + drops. + - **Notebook 04** *(coming in PR 6.2)* — value-aware ranking, + calibration, and cohort-shift evaluation with a seed sweep. + """ + ), + ] + + +def main() -> None: + args = builder_arg_parser( + default_out=DEFAULT_OUT, + description="Build release/notebooks/02_relational_feature_engineering.ipynb", + ).parse_args() + write_notebook(args.out, assemble_notebook(cells())) + + +if __name__ == "__main__": + main() diff --git a/tests/release/__init__.py b/tests/release/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/release/notebooks/__init__.py b/tests/release/notebooks/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/release/notebooks/test_execute_notebooks.py b/tests/release/notebooks/test_execute_notebooks.py new file mode 100644 index 0000000..24beec3 --- /dev/null +++ b/tests/release/notebooks/test_execute_notebooks.py @@ -0,0 +1,67 @@ +"""End-to-end execution gate for the public release notebooks (G13.1). + +Each notebook under ``release/notebooks/*.ipynb`` is executed top to +bottom with ``nbclient`` against the committed public release bundles. +A clean run is the contract: + +* G13.1 — every cell executes from a clean kernel without raising. +* G13.2 — notebook 01's ``assert_within_tolerance`` cell pins notebook + metrics to the cross-seed-median targets in + ``release/notebooks/_release_targets.json`` (per-metric tolerances; + AUC/Brier ±0.02, AP / top-decile ±0.05). The targets file is itself + audit-synced against ``release/validation/validation_report.json`` + by ``test_release_targets_match_report.py``, so the gate + transitively pins to the validation report. +* G13.3 — neither notebook touches ``release/intermediate_instructor`` + or any other instructor artefact (enforced by the notebooks' own + ``BUNDLE = Path("../intermediate")`` path discipline; an instructor + load would fail the public-mode manifest assertion in cell 1). + +The test is gated on the public release bundles being on disk (matches +the HF/Kaggle smoke-test pattern in ``tests/scripts/test_package_*``). +``nbclient`` lives in the optional ``[notebooks]`` extra; ``importorskip`` +keeps the dev install lean while letting the dedicated CI job run the +gate against ``pip install -e ".[dev,scripts,notebooks]"``. +""" + +from __future__ import annotations + +from pathlib import Path + +import pytest + +_REPO_ROOT = Path(__file__).resolve().parents[3] +_NOTEBOOKS_DIR = _REPO_ROOT / "release" / "notebooks" +_RELEASE_BUNDLES_PRESENT = (_REPO_ROOT / "release" / "intermediate" / "manifest.json").exists() + +_NOTEBOOKS = [ + "01_baseline_lead_scoring.ipynb", + "02_relational_feature_engineering.ipynb", +] + + +@pytest.mark.skipif(not _RELEASE_BUNDLES_PRESENT, reason="release bundles not present") +@pytest.mark.parametrize("notebook_name", _NOTEBOOKS) +def test_notebook_executes_end_to_end(notebook_name: str) -> None: + """Execute the notebook with nbclient and surface any cell error. + + Each notebook hard-codes ``BUNDLE = Path("../intermediate")`` and + ``sys.path.insert(0, str(Path.cwd()))`` to import the sibling + ``_notebook_utils`` module — both work iff the kernel cwd is the + notebook directory, so we set ``resources={"metadata": {"path": ...}}`` + accordingly. + """ + nbformat = pytest.importorskip("nbformat", reason="nbformat not installed") + nbclient = pytest.importorskip("nbclient", reason="nbclient not installed") + pytest.importorskip("sklearn", reason="scikit-learn not installed") + pytest.importorskip("matplotlib", reason="matplotlib not installed") + + notebook_path = _NOTEBOOKS_DIR / notebook_name + nb = nbformat.read(notebook_path, as_version=4) + client = nbclient.NotebookClient( + nb, + timeout=180, + kernel_name="python3", + resources={"metadata": {"path": str(_NOTEBOOKS_DIR)}}, + ) + client.execute() diff --git a/tests/release/notebooks/test_notebook_utils.py b/tests/release/notebooks/test_notebook_utils.py new file mode 100644 index 0000000..80e2f59 --- /dev/null +++ b/tests/release/notebooks/test_notebook_utils.py @@ -0,0 +1,247 @@ +"""Unit tests for ``release/notebooks/_notebook_utils.py``. + +The notebook helpers ship inside the public bundle (consumers download +them alongside the parquet tables), so they cannot live inside the +``leadforge`` package import tree. These tests load the module through +``importlib`` and exercise it the way a notebook cell would. +""" + +from __future__ import annotations + +import importlib.util +import sys +from pathlib import Path + +import numpy as np +import pytest + +_MODULE_PATH = Path(__file__).resolve().parents[3] / "release" / "notebooks" / "_notebook_utils.py" +_spec = importlib.util.spec_from_file_location("_notebook_utils", _MODULE_PATH) +assert _spec is not None +assert _spec.loader is not None +nbu = importlib.util.module_from_spec(_spec) +sys.modules["_notebook_utils"] = nbu +_spec.loader.exec_module(nbu) + + +# --------------------------------------------------------------------------- +# precision_at_k +# --------------------------------------------------------------------------- + + +def test_precision_at_k_simple() -> None: + scores = np.array([0.9, 0.8, 0.7, 0.6, 0.5]) + y = np.array([1, 1, 0, 0, 0]) + assert nbu.precision_at_k(scores, y, 2) == pytest.approx(1.0) + assert nbu.precision_at_k(scores, y, 5) == pytest.approx(0.4) + + +def test_precision_at_k_handles_ties_via_stable_sort() -> None: + """Ties resolved by original order — same convention as + ``release_quality._precision_at_k`` so notebook and validation + report agree on tied-score rows. + """ + scores = np.array([0.5, 0.5, 0.5, 0.5]) + y = np.array([1, 0, 1, 0]) + # Stable argsort of -scores preserves [0,1,2,3] order, so top-2 = y[:2] + assert nbu.precision_at_k(scores, y, 2) == pytest.approx(0.5) + + +def test_precision_at_k_invalid_k_returns_nan() -> None: + scores = np.array([0.9, 0.8]) + y = np.array([1, 0]) + assert np.isnan(nbu.precision_at_k(scores, y, 0)) + assert np.isnan(nbu.precision_at_k(scores, y, 3)) + + +def test_top_decile_rate_uses_10_percent_cut() -> None: + rng = np.random.default_rng(0) + scores = rng.random(100) + y = (scores > 0.9).astype(int) + # Top-10 by score = exactly the 10 positives → top decile rate = 1.0 + assert nbu.top_decile_rate(scores, y) == pytest.approx(1.0) + + +def test_top_decile_rate_empty_returns_nan() -> None: + assert np.isnan(nbu.top_decile_rate(np.array([]), np.array([]))) + + +# --------------------------------------------------------------------------- +# assert_within_tolerance +# --------------------------------------------------------------------------- + + +def test_assert_within_tolerance_passes_when_inside_band() -> None: + nbu.assert_within_tolerance( + observed={"auc": 0.88, "ap": 0.57}, + target={"auc": 0.886, "ap": 0.575}, + tolerances=0.05, + ) + + +def test_assert_within_tolerance_fails_when_outside_band() -> None: + with pytest.raises(AssertionError) as exc: + nbu.assert_within_tolerance( + observed={"auc": 0.50}, + target={"auc": 0.886}, + tolerances=0.05, + ) + msg = str(exc.value) + assert "auc" in msg + assert "observed=0.5000" in msg + assert "target=0.8860" in msg + + +def test_assert_within_tolerance_per_metric_tolerances() -> None: + nbu.assert_within_tolerance( + observed={"auc": 0.83, "brier": 0.105}, + target={"auc": 0.886, "brier": 0.110}, + tolerances={"auc": 0.10, "brier": 0.05}, + ) + + +def test_assert_within_tolerance_reports_missing_key() -> None: + with pytest.raises(AssertionError, match="missing from observed metrics"): + nbu.assert_within_tolerance( + observed={"auc": 0.88}, + target={"auc": 0.886, "brier": 0.110}, + tolerances=0.05, + ) + + +def test_assert_within_tolerance_label_appears_in_error() -> None: + with pytest.raises(AssertionError, match="notebook 01"): + nbu.assert_within_tolerance( + observed={"auc": 0.5}, + target={"auc": 0.886}, + tolerances=0.05, + label="notebook 01", + ) + + +def test_assert_within_tolerance_aggregates_multiple_failures() -> None: + with pytest.raises(AssertionError) as exc: + nbu.assert_within_tolerance( + observed={"auc": 0.50, "ap": 0.10}, + target={"auc": 0.886, "ap": 0.575}, + tolerances=0.05, + ) + msg = str(exc.value) + assert "auc" in msg + assert "ap" in msg + + +def test_assert_within_tolerance_ignores_extra_observed_keys() -> None: + """Observed metrics may carry extras (e.g. GBM AUC); the gate only + enforces the keys present in ``target``. + """ + nbu.assert_within_tolerance( + observed={"auc": 0.88, "ap": 0.57, "extra": 999.0}, + target={"auc": 0.886, "ap": 0.575}, + tolerances=0.05, + ) + + +# --------------------------------------------------------------------------- +# Silent-pass paths: NaN / inf observed and target, missing tolerance keys +# --------------------------------------------------------------------------- + + +def test_assert_within_tolerance_fails_on_nan_observed() -> None: + """``NaN > tol`` is ``False`` in IEEE 754; without an explicit + finiteness check a NaN-valued observed metric would slip through + the gate. + """ + with pytest.raises(AssertionError, match="non-finite"): + nbu.assert_within_tolerance( + observed={"auc": float("nan")}, + target={"auc": 0.886}, + tolerances=0.05, + ) + + +def test_assert_within_tolerance_fails_on_inf_observed() -> None: + with pytest.raises(AssertionError, match="non-finite"): + nbu.assert_within_tolerance( + observed={"auc": float("inf")}, + target={"auc": 0.886}, + tolerances=0.05, + ) + + +def test_assert_within_tolerance_fails_on_nan_target() -> None: + """A NaN target would also produce a NaN diff and bypass the gate.""" + with pytest.raises(AssertionError, match="non-finite"): + nbu.assert_within_tolerance( + observed={"auc": 0.85}, + target={"auc": float("nan")}, + tolerances=0.05, + ) + + +def test_assert_within_tolerance_rejects_incomplete_tolerance_mapping() -> None: + """A per-metric tolerances dict missing keys present in ``target`` + used to default each missing key to ``+inf``, silently disabling the + gate for those metrics. The fix is to fail up front with the list + of missing keys, treating it as a configuration error. + """ + with pytest.raises(AssertionError, match="missing entries for target metrics"): + nbu.assert_within_tolerance( + observed={"auc": 0.88, "brier": 0.10}, + target={"auc": 0.886, "brier": 0.110}, + tolerances={"auc": 0.05}, # ``brier`` deliberately missing + ) + + +# --------------------------------------------------------------------------- +# precision_at_k must mirror release_quality._precision_at_k +# --------------------------------------------------------------------------- + + +def test_precision_at_k_mirrors_release_quality() -> None: + """The notebook helper's docstring claims byte-equivalence with + ``leadforge.validation.release_quality._precision_at_k`` (same + stable argsort, same tie-breaking). This test pins that claim: + if either implementation drifts, the notebook's reproduction gate + silently drifts with it. + """ + from leadforge.validation import release_quality + + rng = np.random.default_rng(0) + scores = rng.random(1000) + y = (rng.random(1000) > 0.7).astype(int) + + for k in (1, 10, 50, 100, 250, 500, 999): + nbu_value = nbu.precision_at_k(scores, y, k) + rq_value = release_quality._precision_at_k(scores, y, k) + assert nbu_value == pytest.approx(rq_value), ( + f"divergence at k={k}: notebook helper {nbu_value}, release_quality {rq_value}" + ) + + # Tied scores — the convention drift this is most likely to surface. + tied_scores = np.array([0.5, 0.5, 0.5, 0.5, 0.4, 0.4, 0.3]) + tied_y = np.array([1, 0, 1, 0, 1, 1, 0]) + for k in (1, 2, 4, 6, 7): + assert nbu.precision_at_k(tied_scores, tied_y, k) == pytest.approx( + release_quality._precision_at_k(tied_scores, tied_y, k) + ) + + +def test_top_decile_rate_mirrors_release_quality() -> None: + """``release_quality._top_decile_rate`` and the notebook helper share + the same ``max(1, int(round(n * 0.1)))`` k-selection rule today. + Lock that in: if either side ever changes (e.g. switches to + ``ceil`` or ``floor`` on edge cases), the gate would silently + diverge from the validation report. Includes the exact `n` the + intermediate tier ships (``n_test = 750``) and a few small `n` + where banker's rounding bites. + """ + from leadforge.validation import release_quality + + rng = np.random.default_rng(0) + for n in (5, 10, 25, 99, 100, 750, 1234): + scores = rng.random(n) + y = (rng.random(n) > 0.7).astype(int) + assert nbu.top_decile_rate(scores, y) == pytest.approx( + release_quality._top_decile_rate(scores, y) + ), f"divergence at n={n}" diff --git a/tests/release/notebooks/test_release_targets_match_report.py b/tests/release/notebooks/test_release_targets_match_report.py new file mode 100644 index 0000000..70f4edf --- /dev/null +++ b/tests/release/notebooks/test_release_targets_match_report.py @@ -0,0 +1,44 @@ +"""Audit-sync gate: ``release/notebooks/_release_targets.json`` must +mirror the cross-seed-median values in +``release/validation/validation_report.json``. + +Notebook 01 loads the targets file at runtime and pins its computed +metrics against it via ``assert_within_tolerance`` (G13.2). Without +this audit, the targets file could silently drift from the validation +report — at which point the notebook's "reproduction" gate stops +reproducing anything in particular. This test fails as soon as either +file moves out of step. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +_REPO_ROOT = Path(__file__).resolve().parents[3] +_TARGETS_PATH = _REPO_ROOT / "release" / "notebooks" / "_release_targets.json" +_REPORT_PATH = _REPO_ROOT / "release" / "validation" / "validation_report.json" + + +def test_release_targets_match_validation_report() -> None: + targets = json.loads(_TARGETS_PATH.read_text()) + report = json.loads(_REPORT_PATH.read_text()) + + for tier_name, tier_targets in targets.items(): + if tier_name.startswith("_"): + continue # ``_doc`` and any other meta keys + assert tier_name in report["tiers"], ( + f"targets file mentions tier {tier_name!r} which is absent from " + f"validation_report.json (known tiers: {list(report['tiers'])})" + ) + report_medians = report["tiers"][tier_name]["medians"] + for metric_name, target_value in tier_targets.items(): + assert metric_name in report_medians, ( + f"{tier_name}.{metric_name}: pinned in targets file but absent " + f"from validation_report medians" + ) + assert target_value == report_medians[metric_name], ( + f"{tier_name}.{metric_name}: targets file has {target_value} " + f"but validation_report median is {report_medians[metric_name]} — " + "regenerate the report or update _release_targets.json" + ) diff --git a/tests/scripts/test_release_notebook_builders.py b/tests/scripts/test_release_notebook_builders.py new file mode 100644 index 0000000..c539a77 --- /dev/null +++ b/tests/scripts/test_release_notebook_builders.py @@ -0,0 +1,82 @@ +"""Byte-stability gate for the release-notebook builders. + +The builders advertise an audit-artifact-sync invariant (PR 4.1 / 5.1 / +5.2 pattern): re-running the builder must produce a byte-identical +``.ipynb``, and the committed file under ``release/notebooks/`` must +equal a fresh build. Without this test the invariant is wishful +thinking — ``nbformat.v4.new_*_cell`` randomises cell IDs by default, +so an unguarded builder silently diverges on every run. + +The builders accept ``--out PATH`` so this test can build into +``tmp_path`` instead of mutating the committed ``release/notebooks/`` +file. That keeps the test parallel-safe (pytest-xdist running both +parametrised cases at once won't race), interrupt-safe (an interrupted +run can't leave the working tree dirty), and side-effect-free (the +committed notebook is never touched). +""" + +from __future__ import annotations + +import subprocess +import sys +from pathlib import Path + +import pytest + +_REPO_ROOT = Path(__file__).resolve().parents[2] +_SCRIPTS_DIR = _REPO_ROOT / "scripts" +_NOTEBOOKS_DIR = _REPO_ROOT / "release" / "notebooks" + +_BUILDERS: list[tuple[str, str]] = [ + ("build_release_notebook_01.py", "01_baseline_lead_scoring.ipynb"), + ("build_release_notebook_02.py", "02_relational_feature_engineering.ipynb"), +] + + +@pytest.mark.parametrize(("builder_name", "notebook_name"), _BUILDERS) +def test_builder_is_byte_stable_and_matches_committed( + tmp_path: Path, + builder_name: str, + notebook_name: str, +) -> None: + """Build twice into ``tmp_path`` via ``--out``; assert the two runs + produce byte-identical output and that the committed notebook + matches them. + """ + # ``nbformat`` lives in the optional ``[notebooks]`` extra; the + # main ``test`` CI job installs only ``[dev]`` and would otherwise + # see the subprocess-invoked builders crash with + # ``ModuleNotFoundError: nbformat``. The dedicated ``notebooks`` + # CI job installs ``[dev,scripts,notebooks]`` and runs this test + # alongside ``test_execute_notebooks.py``. + pytest.importorskip("nbformat", reason="nbformat not installed (use [notebooks] extra)") + + builder_path = _SCRIPTS_DIR / builder_name + committed_path = _NOTEBOOKS_DIR / notebook_name + assert builder_path.exists(), f"missing builder: {builder_path}" + assert committed_path.exists(), f"missing committed notebook: {committed_path}" + + run_a = tmp_path / "run_a.ipynb" + run_b = tmp_path / "run_b.ipynb" + + subprocess.run( # noqa: S603 — sys.executable + repo-internal builder path + [sys.executable, str(builder_path), "--out", str(run_a)], + check=True, + cwd=_REPO_ROOT, + ) + subprocess.run( # noqa: S603 — sys.executable + repo-internal builder path + [sys.executable, str(builder_path), "--out", str(run_b)], + check=True, + cwd=_REPO_ROOT, + ) + + assert run_a.read_bytes() == run_b.read_bytes(), ( + f"{builder_name}: two runs produced different bytes — cell IDs are " + "non-deterministic; pass an explicit ``id=`` to nbformat cell " + "constructors (see scripts/_release_notebook_common.py)" + ) + assert committed_path.read_bytes() == run_a.read_bytes(), ( + f"{notebook_name}: committed file does not match a fresh build of " + f"{builder_name} — re-run the builder and commit the result " + "(audit-artifact-sync, same pattern as PR 4.1 / 5.1 / 5.2)" + )