PR 6.1: notebook 01 refresh + notebook 02 relational feature engineering + G13.1 CI gate#73
Merged
Merged
Conversation
…ing + G13.1 CI gate
Refreshes ``release/notebooks/01_baseline_lead_scoring.ipynb`` and adds
``release/notebooks/02_relational_feature_engineering.ipynb``, the first
two notebooks of the Phase 6 sequence. Both ship inside the public
bundle alongside the parquet tables (Kaggle/HF consumers download them
together), so they import a sibling ``_notebook_utils.py`` rather than
relying on the ``leadforge`` package being installed.
Notebook 01 (refresh) — Reproduces ``release/validation/validation_report.md``
intermediate-tier metrics within +/-0.05 (G13.2): trains LR + HistGBM on
the public ``intermediate`` bundle, reports AUC / AP / P@K(50/100/200) /
Brier, drops the documented ``total_touches_all`` leakage trap (still
mentions it for pedagogy + forward-points to notebook 03), then calls
``assert_within_tolerance`` to fail loud on drift before rendering the
decile-lift bar chart and 10-bin reliability diagram. Hard-codes
``BUNDLE = Path("../intermediate")`` and asserts
``manifest.exposure_mode == "student_public"`` so the public-only path
discipline (G13.3) is enforced inline.
Notebook 02 (new) — Loads the seven snapshot-safe public tables
(``customers`` / ``subscriptions`` are not in public bundles per
``BANNED_TABLES``), asserts every ``timestamp <= lead_created_at +
snapshot_day`` for all four event tables, demonstrates four legal joins
(touch-channel breakdown, account-level density, sales-activity recency,
train-only industry target encoding), trains LR + GBM on
flat-baseline-only and flat+relational features, and prints a
4-row metric panel + delta panel. The honest takeaway cell quotes
G7.4.4: GBM(eng)-GBM(flat) is the headline lift, GBM(eng)-LR(flat) is
still slightly negative, consistent with the v1 finding that flat-CSV
LR is hard to beat.
Helpers + tests:
- ``release/notebooks/_notebook_utils.py``: ``precision_at_k`` (stable
argsort, mirrors ``release_quality._precision_at_k``), ``top_decile_rate``,
``assert_within_tolerance`` (per-metric or scalar tolerance, aggregates
failures). Imported by both notebooks via
``sys.path.insert(0, str(Path.cwd())); from _notebook_utils import ...``.
- ``tests/release/notebooks/test_notebook_utils.py``: 12 unit tests on
the helpers (loaded via ``importlib.util`` since the module isn't on
the package import path).
- ``tests/release/notebooks/test_execute_notebooks.py``: parametrised
nbclient end-to-end execution of both notebooks, gated on the public
release bundles being on disk (matches ``test_package_*`` skip pattern).
CI wiring (G13.1):
- ``[notebooks]`` extra in ``pyproject.toml`` (nbclient, nbformat,
scikit-learn, matplotlib) -- keeps the dev install lean while letting
the dedicated CI job run the gate.
- New ``notebooks`` job in ``.github/workflows/ci.yml`` that installs
``[dev,scripts,notebooks]``, regenerates the public release bundles
via ``scripts/build_public_release.py`` (~6 s wall clock), then runs
the nbclient execution test.
Builders:
- ``scripts/_build_release_notebook_01.py`` and ``_build_release_notebook_02.py``
are nbformat-driven one-shot builders so notebook source stays
reviewable in the PR diff (instead of hand-edited JSON). Each
builder shells out to ``ruff format`` on its emitted file, so the
builder output and pre-commit hook agree byte-for-byte (preserves
the audit-artifact-sync invariant PR 4.1 / 5.1 / 5.2 established).
Per-file ruff ignore for E501 because dedented heredoc lines render
as notebook content (markdown tables, print-statement output) where
100c is the wrong yardstick.
Acceptance results on this branch:
- ruff check . / ruff format --check . -- clean
- mypy leadforge/ -- clean (82 source files)
- pytest -- 1246 passed, 5 skipped (publish-extra-gated)
- ``probe_relational_leakage`` on intro/intermediate/advanced -- 0/3 paths
flag rate on every tier
- ``verify_hash_determinism`` -- PASS 67/67
- ``validate_release_candidate --no-rebuild`` -- PASS, 0 leakage findings
- ``BUNDLE_SCHEMA_VERSION`` unchanged at 5
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Marks notebook 01 refresh and notebook 02 (relational feature engineering) done; leaves 03 (leakage + time windows) and 04 (lift / calibration / value ranking) plus the adversarial-framing artefacts (issue templates, break-me guide, v2 decision log) for later PRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
…only joins, JSON-loaded targets
Folds in the brutal self-review I posted on this PR before it goes to
human review. Highlights:
HIGH severity
- **Builders are now byte-stable.** Cell IDs are assigned deterministically
in ``scripts/_release_notebook_common.assemble_notebook`` (was
``nbformat.v4.new_*_cell``'s random UUID), and a new
``tests/scripts/test_release_notebook_builders.py`` runs each builder
twice + diffs against the committed ``.ipynb``. The audit-artifact-sync
invariant the docstring already advertised is now actually enforced —
the test would have caught the previous revision's per-run diff.
- **Snapshot-safe verification cell prints a real number.** The previous
cell computed ``max_offset_days`` and threw it away, then printed a
hardcoded ``(max offset 0.0 days)`` literal in every row. Replaced
with ``min headroom under cutoff: {min_headroom_days:6.2f} days`` —
formatted from ``(cutoff - timestamp).min()``. On the as-shipped
intermediate bundle the receipt now reveals 735 events that land
exactly on the cutoff (headroom 0.00 days), which is a real and
teachable property of the data.
- **``account_density`` is fit on train leads only.** Cell 4.2 was
rolling up touches across train+test pooled — a leak path two cells
above the train-only target encoding that loudly avoided the same
mistake. Now filters ``tch_with_acct`` to ``train["lead_id"]`` before
the groupby; test rows in train-only-empty accounts fall back to 0
via fillna.
- **Notebook 02 has a tolerance gate.** Pinned the four model AUCs
(LR-flat / GBM-flat / LR-eng / GBM-eng) and the headline
``GBM(eng) - GBM(flat)`` lift to per-metric ±0.02 (±0.015 for the
lift), plus a sign-aware ``assert lift > 0``. Without it, a future
regression that flipped the lift negative would have passed the
nbclient gate silently.
- **Per-metric tolerances replace the flat ±0.05.** Notebook 01 now uses
±0.02 on AUC/Brier (cross-seed std is well under that) and ±0.05 on AP
/ top-decile (which have higher seed variance).
- **Notebook 01 actually reproduces the validation report.** The
previous revision dropped ``total_touches_all`` (the leakage trap)
while the report's panel keeps it — the metrics differed by ~0.03 on
GBM AUC and the ±0.05 tolerance was hiding the discrepancy. Notebook
01 now keeps the trap (matches ``release_quality._partition_columns``)
with explicit narrative + forward-pointer to notebook 03; notebook 02
drops the trap and explains why (relational lift attribution stays
clean).
MED severity
- **``release/notebooks/_release_targets.json``** is the new source of
truth for cross-seed-median targets. Notebook 01 loads it at runtime
(no more hardcoded constants) and a new
``tests/release/notebooks/test_release_targets_match_report.py`` asserts
byte equality against ``release/validation/validation_report.json``'s
``tiers.<tier>.medians``. Drift now fails CI. Discovered along the
way: the previous ``lr_precision_at_100: 0.59`` constant wasn't even
*in* the validation report — the report's ranking metric is
``top_decile_rate``. Notebook 01 now computes and gates
``top_decile_rate`` (matches the report).
- **``precision_at_k`` mirror test.** Asserts ``_notebook_utils.precision_at_k``
agrees with ``release_quality._precision_at_k`` over random + tied-score
fixtures, locking the docstring claim in code.
- **Builder duplication eliminated.** ``scripts/_release_notebook_common.py``
hosts ``md`` / ``code`` / ``assemble_notebook`` / ``write_notebook``
(incl. the deterministic-ID assignment and the ``ruff format``
subprocess); each per-notebook builder shrinks to imports + cell list
+ one ``main()``.
- **Industry redundancy fixed.** The ``+rel`` pipelines drop the raw
``industry`` categorical so LR no longer sees one-hot industry *and*
its train-only target encoding for the same column.
- **Honest takeaway softened.** +0.0147 AUC framed as
"suggestive, not conclusive" with a pointer to PR 6.2's seed-sweep
notebook 04 + the cross-seed gbm_auc spread (0.027) for context.
- **ΔBrier sign convention** spelled out under the delta panel.
LOW severity
- **Builder rename.** ``scripts/_build_release_notebook_*.py`` →
``scripts/build_release_notebook_*.py`` (drops the leading underscore
that's inconsistent with every other script in ``scripts/``).
- **CI builds only ``intermediate``.** ``build_public_release.py`` got
a ``--tier`` filter; the notebooks job uses ``--tier intermediate``,
saving 3 redundant bundle generations per CI run.
- **Whitespace + idiom cleanups.** Fixed ``[[ "lead_id"`` extra space;
replaced ``.values`` (pandas-deprecated) with ``.to_numpy()`` in both
notebooks.
- **Broken cross-references** to PR 6.2 / 6.3 / break_me_guide.md
rewritten as plain-text "(coming in PR 6.x)" caveats — no link
syntax to non-existent files.
Verification on this branch:
- ``ruff check . / ruff format --check .`` — clean
- ``mypy leadforge/`` — clean (82 source files)
- ``pytest`` — 1250 passed, 5 skipped (publish-extra-gated)
- ``probe_relational_leakage`` on intro/intermediate/advanced — 0/3
paths flag rate on every tier
- ``verify_hash_determinism`` — PASS 67/67
- ``validate_release_candidate --no-rebuild`` — PASS, 0 leakage findings
- ``BUNDLE_SCHEMA_VERSION`` unchanged at 5
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a CI “notebook execution” gate and supporting utilities/tests to keep the public release notebooks reproducible, byte-stable, and aligned with the published validation metrics.
Changes:
- Introduce notebook builders + shared builder scaffolding to generate deterministic, reviewable
.ipynbartifacts. - Add
release/notebooks/_notebook_utils.pyplus unit tests and audit-sync tests for pinned targets. - Add a GitHub Actions
notebooksjob, plus a--tieroption to build only the needed bundle for CI.
Reviewed changes
Copilot reviewed 15 out of 17 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/scripts/test_release_notebook_builders.py | Verifies notebook builders produce byte-identical notebooks and match committed artifacts. |
| tests/release/notebooks/test_release_targets_match_report.py | Ensures _release_targets.json stays in sync with validation_report.json medians. |
| tests/release/notebooks/test_notebook_utils.py | Unit tests for notebook helper utilities (precision@k, top-decile rate, tolerance gate). |
| tests/release/notebooks/test_execute_notebooks.py | End-to-end nbclient execution test for release notebooks. |
| scripts/build_release_notebook_01.py | Builder script emitting notebook 01 content deterministically. |
| scripts/build_release_notebook_02.py | Builder script emitting notebook 02 content deterministically. |
| scripts/build_public_release.py | Adds --tier flag to build only a selected bundle (used by CI notebooks job). |
| scripts/_release_notebook_common.py | Shared notebook builder helpers, deterministic cell IDs, and ruff-format step. |
| release/notebooks/_release_targets.json | Pins notebook 01 targets (audited against validation report JSON). |
| release/notebooks/_notebook_utils.py | Notebook-local metrics + tolerance assertion helpers (no package dependency). |
| release/notebooks/01_baseline_lead_scoring.ipynb | Refreshed committed notebook 01 artifact (repro gate + plots). |
| release/notebooks/02_relational_feature_engineering.ipynb | New committed notebook 02 artifact (relational FE + lift panel + gate). |
| pyproject.toml | Adds [notebooks] optional dependencies and per-file ruff ignores for notebooks/builders. |
| .github/workflows/ci.yml | Adds notebooks job to build bundle and execute notebooks. |
| .agent-plan.md | Updates project plan entry for PR 6.1 deliverables. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rate mirror, doc/code alignment Folds in all six unresolved Copilot review threads on this PR. C1 — ``--out`` flag on the notebook builders (was: byte-stability test mutated the committed file in place). Both builders now expose ``--out PATH`` via a shared ``_release_notebook_common.builder_arg_parser``; ``tests/scripts/test_release_notebook_builders.py`` builds twice into ``tmp_path`` and diffs against the committed file. Side effect: the test is now safe under pytest-xdist parallel execution and an interrupted run can no longer leave a corrupted ``release/notebooks/<name>.ipynb``. C2 — ``assert_within_tolerance`` rejects silent-pass paths (was: NaN observed/target slipped through because ``NaN > tol`` is ``False``; per-metric tolerance maps missing keys defaulted to ``+inf``, disabling the gate for those metrics). Both paths now fail with clear messages; the docstring describes the contract. C3 — ``top_decile_rate`` gets the mirror test ``precision_at_k`` already had. Both already use the same ``max(1, int(round(n * 0.1)))`` rule as ``release_quality._top_decile_rate``, so no impl change — the test locks that in across small-``n`` (banker's rounding) and the exact ``n_test = 750`` the intermediate tier ships. Drift between ``_notebook_utils`` and ``release_quality`` would now break CI. C4 — ``.agent-plan.md`` and the GitHub PR body both rewritten to match reality after rev-2: notebook 01 *keeps* the trap (matches ``release_quality._partition_columns``, so the G13.2 reproduction gate is meaningful); notebook 02 *drops* the trap (clean lift attribution). The PR body also documents the rev-1 → rev-2 → rev-3 revision history. C5 — ``test_execute_notebooks.py`` docstring rewritten: the gate flow loads ``_release_targets.json`` (audit-synced against ``validation_report.json``), not ``validation_report.md`` directly, and tolerances are per-metric (not flat ±0.05). C6 — four new tests in ``test_notebook_utils.py`` cover the failure modes C2 plugs: NaN observed → fail, inf observed → fail, NaN target → fail, missing-tolerance-key → fail. Acceptance gate results on this branch: - ruff check . / ruff format --check . — clean - mypy leadforge/ — clean (82 source files) - pytest — 1255 passed, 5 skipped (publish-extra-gated) - probe_relational_leakage on intro/intermediate/advanced — 0/3 paths flag rate on every tier - verify_hash_determinism — PASS 67/67 - validate_release_candidate --no-rebuild — PASS, 0 leakage findings - BUNDLE_SCHEMA_VERSION unchanged at 5 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Two real CI failures on the rev-3 commit, both fixed here: 1. **Lint job** failed with ``Would reformat: release/notebooks/02_relational_feature_engineering.ipynb``. CI's ``pip install ruff`` resolved to 0.15.12; my local was 0.11.13. Newer ruff has slightly different ``.ipynb`` cell formatting rules, so the committed notebook (formatted by 0.11.13) didn't pass 0.15.12's check. Fix: pin ``ruff==0.15.12`` in the CI lint job and rebuild ``02_relational_feature_engineering.ipynb`` against 0.15.12 so committed bytes match. Pin lives only in CI (not in ``pyproject.toml`` ``[dev]``) to keep contributor laptops flexible; when CI bumps, re-run the builder locally with the matching ruff and commit the updated ``.ipynb``. 2. **Notebooks job** failed with ``jupyter_client.kernelspec.NoSuchKernel: No such kernel named python3``. GitHub-hosted runners don't ship with any Jupyter kernelspecs; ``[notebooks]`` only pulled ``nbclient`` / ``nbformat``, neither of which provides a kernelspec. The execution test passes ``kernel_name="python3"`` so it needs an actual ``python3`` spec on disk. Fix: add ``ipykernel>=6.0`` to the ``[notebooks]`` extra, plus a CI step ``python -m ipykernel install --user --name python3`` that registers the spec before nbclient boots a kernel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Tests (Python 3.11/3.12) failed on the previous push with
``ModuleNotFoundError: No module named 'nbformat'``. The byte-
stability test in ``tests/scripts/test_release_notebook_builders.py``
shells out to the builders, which import ``nbformat`` — but the main
``test`` CI job installs ``[dev]`` only, not ``[notebooks]``.
Fix:
- ``pytest.importorskip("nbformat")`` at the top of the byte-stability
test, so the main ``test`` job skips it cleanly.
- Have the dedicated ``notebooks`` CI job (which already installs
``[dev,scripts,notebooks]``) run both ``test_execute_notebooks.py``
and ``test_release_notebook_builders.py``. That keeps the byte-
stability gate in CI without bloating the default dev install.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #73 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 6, first PR. Two notebooks land together with a new CI gate and helper test suite. Updated for the rev-2 + rev-3 self-review fixes (rev-3 = the Copilot pass; see
revisionsbelow).release/notebooks/01_baseline_lead_scoring.ipynbactually reproduces the intermediate-tier metrics inrelease/validation/validation_report.json(G13.2) by using the same feature set as the report — drops only IDs and the label, mirrorsrelease_quality._partition_columns. That means notebook 01 keepstotal_touches_all(the documented leakage trap); a narrative cell calls this out explicitly and forward-points to notebook 03 (PR 6.2), which dissects what dropping the trap does to performance. Reports AUC / AP / P@K(50/100/200) / Brier /top_decile_rate(the metric the report tracks);assert_within_tolerancefails loud on drift before rendering the decile-lift bar chart and 10-bin reliability diagram. Hard-codesBUNDLE = Path("../intermediate")and assertsmanifest.exposure_mode == "student_public"so the public-only path discipline (G13.3) is enforced inline. Targets are loaded at runtime fromrelease/notebooks/_release_targets.json(audit-synced againstvalidation_report.json); per-metric tolerances replace the original flat ±0.05 — AUC/Brier ±0.02, AP / top-decile ±0.05.release/notebooks/02_relational_feature_engineering.ipynbloads the seven snapshot-safe public tables (customers/subscriptionsare not in public bundles perBANNED_TABLES), asserts everytimestamp <= lead_created_at + snapshot_dayfor all four event tables (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on train leads only, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, and prints a 4-row metric panel + delta panel. Drops the leakage trap so the relational lift attribution stays clean (different goal from notebook 01: this notebook is teaching feature engineering, not reproducing the report). Pinned viaassert_within_toleranceon the four model AUCs and the headlineGBM(eng) − GBM(flat)lift, plus a sign-awareassert lift > 0. Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive — the cross-seedgbm_aucspread on this bundle is ~0.027.release/notebooks/_notebook_utils.pyships inside the public bundle alongside the parquet tables — Kaggle/HF consumers download both together, so the notebooks can't depend onleadforgebeing installed. Providesprecision_at_kandtop_decile_rate(mirrorrelease_qualitybyte-for-byte; locked in by mirror tests) andassert_within_tolerance(hardened against silent-pass paths: non-finite observed/target fail loudly; per-metric tolerance maps must declare every key intarget— missing entries raise instead of defaulting to+inf). 18 unit tests undertests/release/notebooks/test_notebook_utils.pycover ties, invalid-k, per-metric vs scalar tolerances, missing keys, label propagation, multi-failure aggregation, extra-observed-key tolerance, NaN/inf failure modes, and the tworelease_qualitymirror tests.[notebooks]extra inpyproject.toml(nbclient,nbformat,scikit-learn,matplotlib) keeps the dev install lean. Newnotebooksjob in.github/workflows/ci.ymlinstalls[dev,scripts,notebooks], regenerates the intermediate public bundle viapython scripts/build_public_release.py release --tier intermediate(only tier the notebooks need; new flag), and runstests/release/notebooks/test_execute_notebooks.py— a parametrised nbclient end-to-end execution gated on bundles-present (matchestest_package_*skip pattern).scripts/build_release_notebook_{01,02}.py(sharingscripts/_release_notebook_common.py) are nbformat-driven one-shot builders so notebook source stays reviewable in the diff (instead of hand-edited JSON). Cell IDs are deterministic (cell_NNN) so re-running the builder produces a byte-identical.ipynb. Each builder accepts--out PATHso the byte-stability test can build intotmp_pathrather than mutate the committed file. Audit-artifact-sync is enforced bytests/scripts/test_release_notebook_builders.py, which builds twice intotmp_pathand diffs against the committed file (no working-tree mutation, parallel-safe under pytest-xdist). Each builder shells out toruff formaton its emitted file so builder output and the project's pre-commit hook agree byte-for-byte.Per-file ruff ignore for
S101onrelease/notebooks/**/*.ipynb(G13 contractasserts are deliberate) and forE501onscripts/build_release_notebook_*.py(dedented heredocs render as notebook content; 100c is the wrong yardstick).Revisions
account_densityfit on train leads only, notebook 02 metric gate, per-metric tolerances, JSON-loaded targets +top_decile_rateswap, audit-sync test,precision_at_kmirror test, builder de-duplication, industry redundancy fix,--tierfilter onbuild_public_release.py, builder rename (drop_prefix),.values→.to_numpy(), broken cross-references rewritten as plain text, the agent-plan + PR description rewritten to actually match the code.--out PATHso the byte-stability test never touches the committed file (race-safe and worktree-safe under interrupted runs);assert_within_tolerancerejects non-finite values and incomplete per-metric tolerance maps instead of silently passing; newtop_decile_ratemirror test (parallel toprecision_at_k); 4 new tests covering NaN/inf observed, NaN target, and missing-tolerance-key failure modes; agent-plan + PR body brought back into sync (this rewrite).Acceptance gate results on this branch (rev-3 head)
ruff check ./ruff format --check .— cleanmypy leadforge/— clean (82 source files)pytest— 1255 passed, 5 skipped (publish-extra-gated HF/Kaggle round-trips)python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65— 0/3 paths flag rate on every tierpython scripts/verify_hash_determinism.py— PASS 67/67python scripts/validate_release_candidate.py --no-rebuild— PASS, 0 leakage findingsBUNDLE_SCHEMA_VERSION— unchanged at 5 (no bundle-shape changes in this PR)Test plan
notebooksjob runs the new gate end-to-end (regenerate intermediate bundle via--tier intermediate→ execute both notebooks → tolerance assertions hold).pip install -e ".[notebooks]"works on a clean Python 3.11+ env.release/notebooks/cwd; "Run All" succeeds; output cells match committed metrics within tolerance.revisionsabove).🤖 Generated with Claude Code