PR 6.2: notebooks 03 (leakage + time windows) + 04 (lift / calibration / value / cohort) by shaypal5 · Pull Request #74 · leadforge-dev/leadforge

shaypal5 · 2026-05-07T09:08:25Z

Summary

Notebook 03 — leakage and time windows: turns the documented total_touches_all trap into a teaching moment. Reads the trap label off the feature dictionary; proves the trap by construction via a same-table comparison of total_touches_all (full-horizon) vs touch_count (snapshot-safe) — mean post-snapshot delta 3.2 touches/lead, 82 % of leads positive. Standalone-AUC probe (~0.53, looks innocuous) ⇒ full-panel ± trap ablation showing HistGBM extracts +0.032 AUC where LR only squeezes +0.009. Pedagogical headline: standalone AUC probes undersell tree-friendly leakage.
Notebook 04 — lift / calibration / value-aware ranking / cohort / bootstrap: four ranking lenses AUC alone misses. Reliability diagram, lift + cumulative gains, expected_acv × P(convert) value-aware ranking (top-50 ACV-capture jumps 0.16 → 0.40 vs P-only), threshold selection for fixed top-K capacity, cohort-shift evaluation that reproduces the validation report's cohort_shift.intermediate block exactly (0.8754 / 0.8908 / −0.0155), and a 200-iter bootstrap of the test set as the within-bundle confidence band — explicitly framed as the proxy for true cross-seed sweep, since public-bundle consumers can't rebuild bundles without leadforge installed.
Audit-sync wiring + builder pattern: new cohort_shift.intermediate block in release/notebooks/_release_targets.json, new test_cohort_shift_targets_match_validation_report extension to the existing audit-sync test, both new builders inherit PR 6.1's deterministic-cell-ID + --out byte-stability pattern, both new notebooks added to the byte-stability and end-to-end execution test parametrisations.

Design decisions worth flagging

Notebook 03 framing pivot. The brief proposed an "engagement-only / firmographic-only feature panel ± trap" pedagogy on the assumption that the trap would dominate a thin set. Empirical AUCs say firmographic-only is at chance even with the trap (0.49–0.53), so the pedagogy doesn't hold. The reframed lesson — GBM extracts more from the trap than its standalone AUC predicts; LR doesn't — is honest, comes from the data, and produces a meaningful sign-aware tolerance gate (gbm_lift > 0.015).

Notebook 04 trap posture. Headline LR/GBM panel drops the trap (matches notebook 02; gives honest production numbers). Cohort-shift section keeps the trap to reproduce the report's published cohort-shift numbers exactly. Both postures are explained inline in their respective sections.

Bootstrap, not seed sweep. Resamples test-set indices 200× and recomputes already-scored AUC/AP — sub-second runtime, captures sampling noise on a single generated world but not generation-process noise across seeds. PR 6.1 forward-pointed to a "seed-sweep harness" but a true cross-seed sweep would require regenerating bundles, which Kaggle/HF consumers can't do; bootstrap is the honest within-bundle substitute and the notebook says so.

Cohort-shift reproduction. First implementation used COHORT_TRAIN_FRAC=0.7 and random_state=42; got cohort_split_auc=0.8854 (close but not equal to the report's 0.8908). Self-review caught the validator uses COHORT_TRAIN_FRAC=0.85 and model_random_state=0; switching matched the report to four decimal places.

Acceptance gates (from `docs/release/v1_acceptance_gates.md`)

G13.1 — both notebooks execute top-to-bottom from a clean kernel in <10 s each (well under the 3-min budget).
G13.2 — per-metric tolerances (AUC/Brier ±0.02, AP / top-decile ±0.05); cohort-shift values audit-synced against the validation report; bootstrap medians pinned inline against seed-42 single-run values.
G13.3 — both notebooks assert manifest["exposure_mode"] == "student_public" and load only from release/intermediate/.

Test plan

ruff check . → all checks passed
ruff format --check . → 191 files already formatted
mypy leadforge/ → no issues found in 82 source files
pytest -x -q → 1260 passed, 5 publish-extra-gated skips, 1m58s
pytest tests/scripts/test_release_notebook_builders.py tests/release/notebooks/ → 28 passed (byte-stability + nbclient execution + targets-match-report audit-sync)
python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 → 0/3 paths flag rate on every tier
python scripts/verify_hash_determinism.py → PASS 67/67
python scripts/validate_release_candidate.py --no-rebuild → exits 0; rewrote validation_report.{json,md} timestamps which were reverted before commit
BUNDLE_SCHEMA_VERSION unchanged at 5

Out of scope (PR 6.3)

docs/release/break_me_guide.md — forward-pointed to as plain backtick-wrapped text in both notebooks (no dead Markdown link).
.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml
docs/release/v2_decision_log.md
release/README.md link update

🤖 Generated with Claude Code

…n / value / cohort) + bootstrap robustness harness Adds the second pair of release notebooks: a leakage walkthrough that turns the documented `total_touches_all` trap into a teachable contrast, and a value-aware ranking notebook that covers calibration, lift, expected_acv-aware top-K policies, threshold selection, cohort shift, and a within-bundle bootstrap. Notebook 03 — leakage and time windows - Reads the trap label off `feature_dictionary.csv`; proves the trap by construction via a same-table comparison of `total_touches_all` (full-horizon) vs `touch_count` (snapshot-safe). Mean post-snapshot delta is 3.2 touches/lead; 82 % of leads have a positive delta. - Standalone-AUC probe on the trap (~0.53 — looks innocuous) ⇒ full-panel ± trap ablation showing HistGBM extracts +0.032 AUC where LR only squeezes +0.009. Pedagogical headline: *standalone AUC probes undersell tree-friendly leakage*. - Sign-aware tolerance gate pins each AUC ±0.02 and asserts `gbm_lift > 0.015` so a future regeneration that erases or amplifies the trap breaks CI. Notebook 04 — lift, calibration, value, cohort, bootstrap - Calibration / reliability diagram (LR max bin error ≈ 0.13). - Lift @ 1/5/10 % + cumulative gains curve. - `expected_acv × P(convert)` value-aware ranking: top-50 ACV-capture jumps from 0.16 to 0.40 vs P-only ranking. - Threshold selection for fixed top-K capacity (e.g. 50/week). - Cohort-shift evaluation that **reproduces the validation report's `cohort_shift.intermediate` block exactly** (0.8754 / 0.8908 / −0.0155). Mirrors `release_quality.measure_cohort_ shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and `model_random_state=0`. - 200-iter bootstrap of the test set as the within-bundle confidence band — explicitly framed as the proxy for true cross-seed sweep, since public-bundle consumers can't rebuild bundles without `leadforge` installed. - Headline LR/GBM panel drops the trap (matches notebook 02); cohort-shift section keeps the trap to reproduce the report. Audit-sync wiring - `release/notebooks/_release_targets.json` gains a `cohort_shift.intermediate` block sourced from `validation_report.cohort_shift.intermediate`. - `tests/release/notebooks/test_release_targets_match_report.py` adds `test_cohort_shift_targets_match_validation_report` to audit-sync the new block. Existing test now skips the `cohort_shift` key during the per-tier-medians loop. Builders + tests - `scripts/build_release_notebook_{03,04}.py` inherit the deterministic-cell-ID + `--out` byte-stability pattern from PR 6.1. - Added to `_BUILDERS` / `_NOTEBOOKS` in `tests/scripts/test_release_notebook_builders.py` and `tests/release/notebooks/test_execute_notebooks.py`. - Both notebooks execute end-to-end in <10s each (well under G13.1's 3-min budget), assert `manifest.exposure_mode == "student_public"` (G13.3), and load only from `release/intermediate/`. Forward-pointer to `docs/release/break_me_guide.md` left as plain backtick-wrapped text; file lands in PR 6.3. `.agent-plan.md` Phase 6 entry updated to mark PR 6.2 complete. Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Adds the next two release notebooks (03 and 04) to the v1 educational notebook sequence, along with builder scripts and CI gates to keep the notebooks deterministic, executable, and audit-synced to the validation report.

Changes:

Add Notebook 03 (leakage + time windows) and Notebook 04 (lift/calibration/value ranking/cohort shift/bootstrap), plus their deterministic builders.
Extend notebook CI coverage: execute the new notebooks end-to-end and include their builders in byte-stability checks.
Add a cohort_shift.intermediate targets block and a dedicated audit-sync test to ensure it matches validation_report.json.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`tests/scripts/test_release_notebook_builders.py`	Adds notebook 03/04 builders to the byte-stability + “matches committed” gate.
`tests/release/notebooks/test_release_targets_match_report.py`	Skips `cohort_shift` in the tier-median loop and adds a dedicated cohort-shift audit-sync test.
`tests/release/notebooks/test_execute_notebooks.py`	Adds notebook 03/04 to the nbclient execution matrix.
`scripts/build_release_notebook_03.py`	New deterministic builder that generates notebook 03 content.
`scripts/build_release_notebook_04.py`	New deterministic builder that generates notebook 04 content and pins metrics/tolerances.
`release/notebooks/03_leakage_and_time_windows.ipynb`	Generated notebook 03 committed artifact.
`release/notebooks/04_lift_calibration_value_ranking.ipynb`	Generated notebook 04 committed artifact.
`release/notebooks/_release_targets.json`	Adds `cohort_shift.intermediate` pinned values for notebook 04’s cohort-shift reproduction gate.
`.agent-plan.md`	Marks Phase 6.2 notebook work as complete and documents the delivered content.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…es, kill dead code Brutal-reviewer pass on PR 6.2 caught real issues. Fixes folded back in before requesting human review. Narrative contradictions / wrong numbers (must-fix) - NB04 §7 opened with "train on the first 70 % chronologically" but the code uses 85/15 (matching the validator's COHORT_TRAIN_FRAC=0.85). Whiplash inside one section. Rewrote the opening to use 85/15 from the start and tightened the explanation of why random_state=0 also matters. - NB04 §10 summary cited max bin error ≈ 0.19 and top-decile lift ≈ 2.6×. Both were copied from with-trap report values; the headline panel drops the trap, and the actual observations are 0.13 and 2.75×. Corrected, with a one-line note distinguishing the trap-dropped vs trap-kept numbers. - NB03 §4 cited the report's standalone-AUC of "0.55" and then observed 0.53 without explaining the gap. Added a sentence pinning the methodological difference: the report fits a full LR pipeline on the single column; the notebook uses the raw column as a score directly. Both fall in "barely above chance"; the difference is real and now owned. Tolerance gates that didn't gate the claim (should-fix) - NB03 trap_standalone_auc was pinned to a ±0.02 band around 0.531 — which silently allows the trap to drop to 0.49 (below chance) while the test still passes, breaking section 5's setup. Added a sign-aware ``assert standalone[TRAP] > 0.51``. - NB03 §3 ``mean_delta > 0`` was performative (passed if even one lead had a positive delta). Hardened to ``mean_delta > 1.0`` AND ``n_post / len(window) > 0.5``, sitting well below the observed 3.22 / 82 % but well above "barely positive." - NB04 §5 value-aware ranking guard was ``gain >= -0.01`` (allows ranking to *lose* ground by up to 0.01). Strengthened to ``gain > MIN_VALUE_GAIN = 0.05`` so a regression that erodes the P×ACV story actually fails CI. - Audit-sync test for the cohort_shift block had ``if not cohort_targets: return`` — silently passed if someone deleted or renamed the block. Made the block required; if notebook 04 ever stops needing it, the test should be deleted, not bypassed. Code quality (should-fix) - NB04 ``acv_capture(scores, k)`` re-sorted the score array on every call (60+ calls × O(N log N) per call). Pre-compute the argsort + cumsum once; every plot point is now a cheap cumulative-array lookup. Function signature changed from ``(scores, k)`` to ``(use_value: bool, k)`` and the §9 tolerance-gate call sites were updated to match. - NB04 §6 narrative promised "precision / recall / count above threshold" but the plots showed only count and precision. Added a third panel for recall above threshold so the prose matches the figure. - NB03 dead code: ``with_trap_p100`` / ``without_trap_p100`` keys were computed via ``precision_at_k(...)`` but never read downstream. Removed both keys and the now-unused ``precision_at_k`` import. - NB03 §4 ``standalone_window = window.dropna(...).copy()`` was a no-op (window was already dropped of NaN in §3). Reuse window directly with an inline comment about why. - NB03 §1 ``HORIZON_DAYS`` was read from the manifest and printed but never used in the narrative. Wove it into the setup print so the 60-day post-snapshot hunting window is explicit before §3 measures it. - NB04 §5 plot title ``"Value-aware ranking captures more revenue per outreach slot"`` was a thesis used as a label. Made it neutral ("ACV capture vs top-K ..."); the conclusion stays in the narrative. All 28 notebook builder + execution + audit-sync tests still pass; 1260/1260 full-suite tests pass; ruff + mypy clean; both notebooks execute end-to-end in <10 s each. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot's review of the PR 6.2 self-review surfaced four still-real findings (two others were marked outdated — already addressed by the self-review pass). Folded all four back in before requesting human review. - (COPILOT-1) Grammar: "reflect a honest production model" → "reflect an honest production model" in NB04 §2 narrative. - (COPILOT-3) The threshold-selection comment claimed the K-th highest probability + ``mask = lr_probs >= threshold`` admits *exactly* K leads, "ties resolved by score order". The inclusive comparison can admit more than K when leads share the threshold's probability — and there is no implicit tie-break. Rewrote the comment to be honest about the semantics ("admits AT LEAST K via probs >= threshold; ties at the threshold can inflate the slate; ``actually_above`` makes the realised count visible"). Kept the threshold-based selection rather than switching to a true top-K via ``argsort`` because the pedagogical point of section 6 is *threshold selection*, not *rank cutoff*. - (COPILOT-4) Bootstrap loop comment said "Degenerate resample — re-roll" but the implementation writes NaN and continues. Rewrote the comment to match what the code does (mark NaN, let ``_summary`` filter it out) and added the actual probability bound — with n_test=750 and base rate ~22 %, the all-positive or all-negative draw probability is ~10⁻¹⁰⁰, so the branch is dead in practice and exists only as a defensive safety net for tiny test sets. Implementing a real re-roll loop would never execute on this dataset. - (COPILOT-6) The top-level ``_doc`` in ``release/notebooks/_release_targets.json`` claimed the file contains only "cross-seed-median metric values", which became inaccurate after the PR 6.2 cohort_shift block (single-seed, seed 42). Rewrote the docstring to call out the mixed structure: per-tier blocks hold cross-seed medians; cohort_shift block holds single-seed values from ``validation_report.cohort_shift``. Audit-sync test continues to enforce both cases via separate loops. - (COPILOT-2, COPILOT-5) outdated; already addressed by the self-review fix-up commit (25b9ec1) — the 70/30-vs-85/15 narrative whiplash and the wrong summary numbers (0.19 / 2.6× vs the actual 0.13 / 2.75×) are both fixed there. Resolved as already-treated. Net: 28/28 notebook builder + execution + audit-sync tests pass; ruff + mypy clean; both notebooks still execute end-to-end in <10 s each. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-07T10:46:49Z

pr-agent-context report:

This run includes unresolved review comments on PR #74 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: scripts/build_release_notebook_04.py
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276155
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    Grammar: “reflect a honest production model” should be “reflect an honest production model”.

## COPILOT-2
Location: scripts/build_release_notebook_04.py
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276229
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The cohort-shift description says “train on the first 70% … score the last 30%”, but immediately below (and in the code) the split is 85%/15% (COHORT_TRAIN_FRAC=0.85). Please update the prose so it matches the implemented/validated split to avoid confusing readers and to keep the “mirrors the validator” claim precise.

## COPILOT-3
Location: scripts/build_release_notebook_04.py:433
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276269
Root author: copilot-pull-request-reviewer

Comment:
    The comment says the K-th highest probability threshold “admits exactly K leads (ties resolved by score order)”, but using `mask = lr_probs >= threshold` can admit more than K when there are ties at the threshold, and there’s no explicit tie-break. Either adjust the comment/prose, or implement a true top-K selection (e.g., via argsort indices) if you want exactly CAPACITY leads.

## COPILOT-4
Location: scripts/build_release_notebook_04.py:664
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276306
Root author: copilot-pull-request-reviewer

Comment:
    Bootstrap loop: the code comment says “Degenerate resample — re-roll”, but the implementation writes NaNs and continues. Since the arrays are later used for quantiles/histograms, it’d be more robust to actually re-draw indices until non-degenerate (or explicitly filter finite values everywhere and handle the all-NaN case).

## COPILOT-5
Location: scripts/build_release_notebook_04.py
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276331
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    Summary numbers appear inconsistent with the pinned/printed metrics above (e.g., max bin error is pinned at 0.1344 in NB04_TARGETS and lift@10% is ~2.75, but the summary says ≈0.19 and ~2.6×). Please update the summary text so it reflects the actual computed values for this notebook run.

## COPILOT-6
Location: release/notebooks/_release_targets.json
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276360
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The top-level `_doc` says this file contains only “Cross-seed-median metric values”, but the new `cohort_shift` block is explicitly single-seed (seed 42). Consider updating the top-level docstring to reflect that the file now contains a mix of cross-seed medians (per-tier metrics) and single-seed cohort-shift metrics, so downstream readers don’t assume everything here is a median.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25491133072 attempt 1
Comment timestamp: 2026-05-07T10:46:00.887507+00:00
PR head commit: 397feb544b3314f235a58cd5bb73bed480ba652f

Copilot AI review requested due to automatic review settings May 7, 2026 09:08

shaypal5 added this to the dataset: leadforge-lead-scoring-v1 milestone May 7, 2026

shaypal5 added type: feature New capability layer: recipes recipes/ recipe assets and registry labels May 7, 2026

Copilot started reviewing on behalf of shaypal5 May 7, 2026 09:09 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed May 7, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

shaypal5 merged commit 43b75b4 into main May 7, 2026
10 of 17 checks passed

shaypal5 deleted the feat/notebooks-leakage-and-value-ranking branch May 7, 2026 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR 6.2: notebooks 03 (leakage + time windows) + 04 (lift / calibration / value / cohort)#74

PR 6.2: notebooks 03 (leakage + time windows) + 04 (lift / calibration / value / cohort)#74
shaypal5 merged 3 commits into
mainfrom
feat/notebooks-leakage-and-value-ranking

shaypal5 commented May 7, 2026

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented May 7, 2026

Summary

Design decisions worth flagging

Acceptance gates (from docs/release/v1_acceptance_gates.md)

Test plan

Out of scope (PR 6.3)

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Acceptance gates (from `docs/release/v1_acceptance_gates.md`)