Skip to content

PR 6.2: notebooks 03 (leakage + time windows) + 04 (lift / calibration / value / cohort)#74

Merged
shaypal5 merged 3 commits into
mainfrom
feat/notebooks-leakage-and-value-ranking
May 7, 2026
Merged

PR 6.2: notebooks 03 (leakage + time windows) + 04 (lift / calibration / value / cohort)#74
shaypal5 merged 3 commits into
mainfrom
feat/notebooks-leakage-and-value-ranking

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 7, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Notebook 03 — leakage and time windows: turns the documented total_touches_all trap into a teaching moment. Reads the trap label off the feature dictionary; proves the trap by construction via a same-table comparison of total_touches_all (full-horizon) vs touch_count (snapshot-safe) — mean post-snapshot delta 3.2 touches/lead, 82 % of leads positive. Standalone-AUC probe (~0.53, looks innocuous) ⇒ full-panel ± trap ablation showing HistGBM extracts +0.032 AUC where LR only squeezes +0.009. Pedagogical headline: standalone AUC probes undersell tree-friendly leakage.
  • Notebook 04 — lift / calibration / value-aware ranking / cohort / bootstrap: four ranking lenses AUC alone misses. Reliability diagram, lift + cumulative gains, expected_acv × P(convert) value-aware ranking (top-50 ACV-capture jumps 0.16 → 0.40 vs P-only), threshold selection for fixed top-K capacity, cohort-shift evaluation that reproduces the validation report's cohort_shift.intermediate block exactly (0.8754 / 0.8908 / −0.0155), and a 200-iter bootstrap of the test set as the within-bundle confidence band — explicitly framed as the proxy for true cross-seed sweep, since public-bundle consumers can't rebuild bundles without leadforge installed.
  • Audit-sync wiring + builder pattern: new cohort_shift.intermediate block in release/notebooks/_release_targets.json, new test_cohort_shift_targets_match_validation_report extension to the existing audit-sync test, both new builders inherit PR 6.1's deterministic-cell-ID + --out byte-stability pattern, both new notebooks added to the byte-stability and end-to-end execution test parametrisations.

Design decisions worth flagging

Notebook 03 framing pivot. The brief proposed an "engagement-only / firmographic-only feature panel ± trap" pedagogy on the assumption that the trap would dominate a thin set. Empirical AUCs say firmographic-only is at chance even with the trap (0.49–0.53), so the pedagogy doesn't hold. The reframed lesson — GBM extracts more from the trap than its standalone AUC predicts; LR doesn't — is honest, comes from the data, and produces a meaningful sign-aware tolerance gate (gbm_lift > 0.015).

Notebook 04 trap posture. Headline LR/GBM panel drops the trap (matches notebook 02; gives honest production numbers). Cohort-shift section keeps the trap to reproduce the report's published cohort-shift numbers exactly. Both postures are explained inline in their respective sections.

Bootstrap, not seed sweep. Resamples test-set indices 200× and recomputes already-scored AUC/AP — sub-second runtime, captures sampling noise on a single generated world but not generation-process noise across seeds. PR 6.1 forward-pointed to a "seed-sweep harness" but a true cross-seed sweep would require regenerating bundles, which Kaggle/HF consumers can't do; bootstrap is the honest within-bundle substitute and the notebook says so.

Cohort-shift reproduction. First implementation used COHORT_TRAIN_FRAC=0.7 and random_state=42; got cohort_split_auc=0.8854 (close but not equal to the report's 0.8908). Self-review caught the validator uses COHORT_TRAIN_FRAC=0.85 and model_random_state=0; switching matched the report to four decimal places.

Acceptance gates (from docs/release/v1_acceptance_gates.md)

  • G13.1 — both notebooks execute top-to-bottom from a clean kernel in <10 s each (well under the 3-min budget).
  • G13.2 — per-metric tolerances (AUC/Brier ±0.02, AP / top-decile ±0.05); cohort-shift values audit-synced against the validation report; bootstrap medians pinned inline against seed-42 single-run values.
  • G13.3 — both notebooks assert manifest["exposure_mode"] == "student_public" and load only from release/intermediate/.

Test plan

  • ruff check . → all checks passed
  • ruff format --check . → 191 files already formatted
  • mypy leadforge/ → no issues found in 82 source files
  • pytest -x -q → 1260 passed, 5 publish-extra-gated skips, 1m58s
  • pytest tests/scripts/test_release_notebook_builders.py tests/release/notebooks/ → 28 passed (byte-stability + nbclient execution + targets-match-report audit-sync)
  • python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65 → 0/3 paths flag rate on every tier
  • python scripts/verify_hash_determinism.py → PASS 67/67
  • python scripts/validate_release_candidate.py --no-rebuild → exits 0; rewrote validation_report.{json,md} timestamps which were reverted before commit
  • BUNDLE_SCHEMA_VERSION unchanged at 5

Out of scope (PR 6.3)

  • docs/release/break_me_guide.md — forward-pointed to as plain backtick-wrapped text in both notebooks (no dead Markdown link).
  • .github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml
  • docs/release/v2_decision_log.md
  • release/README.md link update

🤖 Generated with Claude Code

…n / value / cohort) + bootstrap robustness harness

Adds the second pair of release notebooks: a leakage walkthrough that
turns the documented `total_touches_all` trap into a teachable
contrast, and a value-aware ranking notebook that covers calibration,
lift, expected_acv-aware top-K policies, threshold selection, cohort
shift, and a within-bundle bootstrap.

Notebook 03 — leakage and time windows
  - Reads the trap label off `feature_dictionary.csv`; proves the
    trap by construction via a same-table comparison of
    `total_touches_all` (full-horizon) vs `touch_count`
    (snapshot-safe). Mean post-snapshot delta is 3.2 touches/lead;
    82 % of leads have a positive delta.
  - Standalone-AUC probe on the trap (~0.53 — looks innocuous) ⇒
    full-panel ± trap ablation showing HistGBM extracts +0.032 AUC
    where LR only squeezes +0.009. Pedagogical headline:
    *standalone AUC probes undersell tree-friendly leakage*.
  - Sign-aware tolerance gate pins each AUC ±0.02 and asserts
    `gbm_lift > 0.015` so a future regeneration that erases or
    amplifies the trap breaks CI.

Notebook 04 — lift, calibration, value, cohort, bootstrap
  - Calibration / reliability diagram (LR max bin error ≈ 0.13).
  - Lift @ 1/5/10 % + cumulative gains curve.
  - `expected_acv × P(convert)` value-aware ranking: top-50
    ACV-capture jumps from 0.16 to 0.40 vs P-only ranking.
  - Threshold selection for fixed top-K capacity (e.g. 50/week).
  - Cohort-shift evaluation that **reproduces the validation
    report's `cohort_shift.intermediate` block exactly** (0.8754 /
    0.8908 / −0.0155). Mirrors `release_quality.measure_cohort_
    shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and
    `model_random_state=0`.
  - 200-iter bootstrap of the test set as the within-bundle
    confidence band — explicitly framed as the proxy for true
    cross-seed sweep, since public-bundle consumers can't rebuild
    bundles without `leadforge` installed.
  - Headline LR/GBM panel drops the trap (matches notebook 02);
    cohort-shift section keeps the trap to reproduce the report.

Audit-sync wiring
  - `release/notebooks/_release_targets.json` gains a
    `cohort_shift.intermediate` block sourced from
    `validation_report.cohort_shift.intermediate`.
  - `tests/release/notebooks/test_release_targets_match_report.py`
    adds `test_cohort_shift_targets_match_validation_report` to
    audit-sync the new block. Existing test now skips the
    `cohort_shift` key during the per-tier-medians loop.

Builders + tests
  - `scripts/build_release_notebook_{03,04}.py` inherit the
    deterministic-cell-ID + `--out` byte-stability pattern from
    PR 6.1.
  - Added to `_BUILDERS` / `_NOTEBOOKS` in
    `tests/scripts/test_release_notebook_builders.py` and
    `tests/release/notebooks/test_execute_notebooks.py`.
  - Both notebooks execute end-to-end in <10s each (well under
    G13.1's 3-min budget), assert
    `manifest.exposure_mode == "student_public"` (G13.3), and
    load only from `release/intermediate/`.

Forward-pointer to `docs/release/break_me_guide.md` left as plain
backtick-wrapped text; file lands in PR 6.3.

`.agent-plan.md` Phase 6 entry updated to mark PR 6.2 complete.

Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff +
mypy clean; leakage probes 0/3 on every tier; hash determinism
PASS 67/67; `validate_release_candidate --no-rebuild` exits 0;
`BUNDLE_SCHEMA_VERSION` unchanged at 5.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 7, 2026 09:08
@shaypal5 shaypal5 added type: feature New capability layer: recipes recipes/ recipe assets and registry labels May 7, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the next two release notebooks (03 and 04) to the v1 educational notebook sequence, along with builder scripts and CI gates to keep the notebooks deterministic, executable, and audit-synced to the validation report.

Changes:

  • Add Notebook 03 (leakage + time windows) and Notebook 04 (lift/calibration/value ranking/cohort shift/bootstrap), plus their deterministic builders.
  • Extend notebook CI coverage: execute the new notebooks end-to-end and include their builders in byte-stability checks.
  • Add a cohort_shift.intermediate targets block and a dedicated audit-sync test to ensure it matches validation_report.json.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/scripts/test_release_notebook_builders.py Adds notebook 03/04 builders to the byte-stability + “matches committed” gate.
tests/release/notebooks/test_release_targets_match_report.py Skips cohort_shift in the tier-median loop and adds a dedicated cohort-shift audit-sync test.
tests/release/notebooks/test_execute_notebooks.py Adds notebook 03/04 to the nbclient execution matrix.
scripts/build_release_notebook_03.py New deterministic builder that generates notebook 03 content.
scripts/build_release_notebook_04.py New deterministic builder that generates notebook 04 content and pins metrics/tolerances.
release/notebooks/03_leakage_and_time_windows.ipynb Generated notebook 03 committed artifact.
release/notebooks/04_lift_calibration_value_ranking.ipynb Generated notebook 04 committed artifact.
release/notebooks/_release_targets.json Adds cohort_shift.intermediate pinned values for notebook 04’s cohort-shift reproduction gate.
.agent-plan.md Marks Phase 6.2 notebook work as complete and documents the delivered content.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/build_release_notebook_04.py Outdated
Comment thread scripts/build_release_notebook_04.py Outdated
Comment thread scripts/build_release_notebook_04.py
Comment thread scripts/build_release_notebook_04.py
Comment thread scripts/build_release_notebook_04.py Outdated
Comment thread release/notebooks/_release_targets.json Outdated
…es, kill dead code

Brutal-reviewer pass on PR 6.2 caught real issues. Fixes folded back
in before requesting human review.

Narrative contradictions / wrong numbers (must-fix)
  - NB04 §7 opened with "train on the first 70 % chronologically"
    but the code uses 85/15 (matching the validator's
    COHORT_TRAIN_FRAC=0.85). Whiplash inside one section. Rewrote
    the opening to use 85/15 from the start and tightened the
    explanation of why random_state=0 also matters.
  - NB04 §10 summary cited max bin error ≈ 0.19 and top-decile
    lift ≈ 2.6×. Both were copied from with-trap report values; the
    headline panel drops the trap, and the actual observations are
    0.13 and 2.75×. Corrected, with a one-line note distinguishing
    the trap-dropped vs trap-kept numbers.
  - NB03 §4 cited the report's standalone-AUC of "0.55" and then
    observed 0.53 without explaining the gap. Added a sentence
    pinning the methodological difference: the report fits a full
    LR pipeline on the single column; the notebook uses the raw
    column as a score directly. Both fall in "barely above
    chance"; the difference is real and now owned.

Tolerance gates that didn't gate the claim (should-fix)
  - NB03 trap_standalone_auc was pinned to a ±0.02 band around
    0.531 — which silently allows the trap to drop to 0.49 (below
    chance) while the test still passes, breaking section 5's
    setup. Added a sign-aware ``assert standalone[TRAP] > 0.51``.
  - NB03 §3 ``mean_delta > 0`` was performative (passed if even
    one lead had a positive delta). Hardened to
    ``mean_delta > 1.0`` AND ``n_post / len(window) > 0.5``,
    sitting well below the observed 3.22 / 82 % but well above
    "barely positive."
  - NB04 §5 value-aware ranking guard was ``gain >= -0.01``
    (allows ranking to *lose* ground by up to 0.01). Strengthened
    to ``gain > MIN_VALUE_GAIN = 0.05`` so a regression that
    erodes the P×ACV story actually fails CI.
  - Audit-sync test for the cohort_shift block had
    ``if not cohort_targets: return`` — silently passed if
    someone deleted or renamed the block. Made the block
    required; if notebook 04 ever stops needing it, the test
    should be deleted, not bypassed.

Code quality (should-fix)
  - NB04 ``acv_capture(scores, k)`` re-sorted the score array on
    every call (60+ calls × O(N log N) per call). Pre-compute the
    argsort + cumsum once; every plot point is now a cheap
    cumulative-array lookup. Function signature changed from
    ``(scores, k)`` to ``(use_value: bool, k)`` and the §9
    tolerance-gate call sites were updated to match.
  - NB04 §6 narrative promised "precision / recall / count above
    threshold" but the plots showed only count and precision.
    Added a third panel for recall above threshold so the prose
    matches the figure.
  - NB03 dead code: ``with_trap_p100`` / ``without_trap_p100``
    keys were computed via ``precision_at_k(...)`` but never
    read downstream. Removed both keys and the now-unused
    ``precision_at_k`` import.
  - NB03 §4 ``standalone_window = window.dropna(...).copy()``
    was a no-op (window was already dropped of NaN in §3).
    Reuse window directly with an inline comment about why.
  - NB03 §1 ``HORIZON_DAYS`` was read from the manifest and
    printed but never used in the narrative. Wove it into the
    setup print so the 60-day post-snapshot hunting window is
    explicit before §3 measures it.
  - NB04 §5 plot title ``"Value-aware ranking captures more
    revenue per outreach slot"`` was a thesis used as a label.
    Made it neutral ("ACV capture vs top-K ..."); the conclusion
    stays in the narrative.

All 28 notebook builder + execution + audit-sync tests still pass;
1260/1260 full-suite tests pass; ruff + mypy clean; both notebooks
execute end-to-end in <10 s each.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Copilot's review of the PR 6.2 self-review surfaced four still-real
findings (two others were marked outdated — already addressed by the
self-review pass).  Folded all four back in before requesting human
review.

  - (COPILOT-1) Grammar: "reflect a honest production model" →
    "reflect an honest production model" in NB04 §2 narrative.
  - (COPILOT-3) The threshold-selection comment claimed the K-th
    highest probability + ``mask = lr_probs >= threshold`` admits
    *exactly* K leads, "ties resolved by score order".  The
    inclusive comparison can admit more than K when leads share the
    threshold's probability — and there is no implicit tie-break.
    Rewrote the comment to be honest about the semantics ("admits
    AT LEAST K via probs >= threshold; ties at the threshold can
    inflate the slate; ``actually_above`` makes the realised count
    visible").  Kept the threshold-based selection rather than
    switching to a true top-K via ``argsort`` because the
    pedagogical point of section 6 is *threshold selection*, not
    *rank cutoff*.
  - (COPILOT-4) Bootstrap loop comment said "Degenerate resample —
    re-roll" but the implementation writes NaN and continues.
    Rewrote the comment to match what the code does (mark NaN, let
    ``_summary`` filter it out) and added the actual probability
    bound — with n_test=750 and base rate ~22 %, the all-positive
    or all-negative draw probability is ~10⁻¹⁰⁰, so the branch is
    dead in practice and exists only as a defensive safety net for
    tiny test sets.  Implementing a real re-roll loop would never
    execute on this dataset.
  - (COPILOT-6) The top-level ``_doc`` in
    ``release/notebooks/_release_targets.json`` claimed the file
    contains only "cross-seed-median metric values", which became
    inaccurate after the PR 6.2 cohort_shift block (single-seed,
    seed 42).  Rewrote the docstring to call out the mixed
    structure: per-tier blocks hold cross-seed medians; cohort_shift
    block holds single-seed values from
    ``validation_report.cohort_shift``.  Audit-sync test continues
    to enforce both cases via separate loops.

  - (COPILOT-2, COPILOT-5) outdated; already addressed by the
    self-review fix-up commit (25b9ec1) — the 70/30-vs-85/15
    narrative whiplash and the wrong summary numbers (0.19 / 2.6×
    vs the actual 0.13 / 2.75×) are both fixed there.  Resolved as
    already-treated.

Net: 28/28 notebook builder + execution + audit-sync tests pass;
ruff + mypy clean; both notebooks still execute end-to-end in
<10 s each.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #74 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: scripts/build_release_notebook_04.py
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276155
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    Grammar: “reflect a honest production model” should be “reflect an honest production model”.

## COPILOT-2
Location: scripts/build_release_notebook_04.py
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276229
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The cohort-shift description says “train on the first 70% … score the last 30%”, but immediately below (and in the code) the split is 85%/15% (COHORT_TRAIN_FRAC=0.85). Please update the prose so it matches the implemented/validated split to avoid confusing readers and to keep the “mirrors the validator” claim precise.

## COPILOT-3
Location: scripts/build_release_notebook_04.py:433
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276269
Root author: copilot-pull-request-reviewer

Comment:
    The comment says the K-th highest probability threshold “admits exactly K leads (ties resolved by score order)”, but using `mask = lr_probs >= threshold` can admit more than K when there are ties at the threshold, and there’s no explicit tie-break. Either adjust the comment/prose, or implement a true top-K selection (e.g., via argsort indices) if you want exactly CAPACITY leads.

## COPILOT-4
Location: scripts/build_release_notebook_04.py:664
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276306
Root author: copilot-pull-request-reviewer

Comment:
    Bootstrap loop: the code comment says “Degenerate resample — re-roll”, but the implementation writes NaNs and continues. Since the arrays are later used for quantiles/histograms, it’d be more robust to actually re-draw indices until non-degenerate (or explicitly filter finite values everywhere and handle the all-NaN case).

## COPILOT-5
Location: scripts/build_release_notebook_04.py
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276331
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    Summary numbers appear inconsistent with the pinned/printed metrics above (e.g., max bin error is pinned at 0.1344 in NB04_TARGETS and lift@10% is ~2.75, but the summary says ≈0.19 and ~2.6×). Please update the summary text so it reflects the actual computed values for this notebook run.

## COPILOT-6
Location: release/notebooks/_release_targets.json
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276360
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The top-level `_doc` says this file contains only “Cross-seed-median metric values”, but the new `cohort_shift` block is explicitly single-seed (seed 42). Consider updating the top-level docstring to reflect that the file now contains a mix of cross-seed medians (per-tier metrics) and single-seed cohort-shift metrics, so downstream readers don’t assume everything here is a median.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25491133072 attempt 1
Comment timestamp: 2026-05-07T10:46:00.887507+00:00
PR head commit: 397feb544b3314f235a58cd5bb73bed480ba652f

@shaypal5 shaypal5 merged commit 43b75b4 into main May 7, 2026
10 of 17 checks passed
@shaypal5 shaypal5 deleted the feat/notebooks-leakage-and-value-ranking branch May 7, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: recipes recipes/ recipe assets and registry type: feature New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants