PR 6.2: notebooks 03 (leakage + time windows) + 04 (lift / calibration / value / cohort)#74
Merged
Merged
Conversation
…n / value / cohort) + bootstrap robustness harness
Adds the second pair of release notebooks: a leakage walkthrough that
turns the documented `total_touches_all` trap into a teachable
contrast, and a value-aware ranking notebook that covers calibration,
lift, expected_acv-aware top-K policies, threshold selection, cohort
shift, and a within-bundle bootstrap.
Notebook 03 — leakage and time windows
- Reads the trap label off `feature_dictionary.csv`; proves the
trap by construction via a same-table comparison of
`total_touches_all` (full-horizon) vs `touch_count`
(snapshot-safe). Mean post-snapshot delta is 3.2 touches/lead;
82 % of leads have a positive delta.
- Standalone-AUC probe on the trap (~0.53 — looks innocuous) ⇒
full-panel ± trap ablation showing HistGBM extracts +0.032 AUC
where LR only squeezes +0.009. Pedagogical headline:
*standalone AUC probes undersell tree-friendly leakage*.
- Sign-aware tolerance gate pins each AUC ±0.02 and asserts
`gbm_lift > 0.015` so a future regeneration that erases or
amplifies the trap breaks CI.
Notebook 04 — lift, calibration, value, cohort, bootstrap
- Calibration / reliability diagram (LR max bin error ≈ 0.13).
- Lift @ 1/5/10 % + cumulative gains curve.
- `expected_acv × P(convert)` value-aware ranking: top-50
ACV-capture jumps from 0.16 to 0.40 vs P-only ranking.
- Threshold selection for fixed top-K capacity (e.g. 50/week).
- Cohort-shift evaluation that **reproduces the validation
report's `cohort_shift.intermediate` block exactly** (0.8754 /
0.8908 / −0.0155). Mirrors `release_quality.measure_cohort_
shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and
`model_random_state=0`.
- 200-iter bootstrap of the test set as the within-bundle
confidence band — explicitly framed as the proxy for true
cross-seed sweep, since public-bundle consumers can't rebuild
bundles without `leadforge` installed.
- Headline LR/GBM panel drops the trap (matches notebook 02);
cohort-shift section keeps the trap to reproduce the report.
Audit-sync wiring
- `release/notebooks/_release_targets.json` gains a
`cohort_shift.intermediate` block sourced from
`validation_report.cohort_shift.intermediate`.
- `tests/release/notebooks/test_release_targets_match_report.py`
adds `test_cohort_shift_targets_match_validation_report` to
audit-sync the new block. Existing test now skips the
`cohort_shift` key during the per-tier-medians loop.
Builders + tests
- `scripts/build_release_notebook_{03,04}.py` inherit the
deterministic-cell-ID + `--out` byte-stability pattern from
PR 6.1.
- Added to `_BUILDERS` / `_NOTEBOOKS` in
`tests/scripts/test_release_notebook_builders.py` and
`tests/release/notebooks/test_execute_notebooks.py`.
- Both notebooks execute end-to-end in <10s each (well under
G13.1's 3-min budget), assert
`manifest.exposure_mode == "student_public"` (G13.3), and
load only from `release/intermediate/`.
Forward-pointer to `docs/release/break_me_guide.md` left as plain
backtick-wrapped text; file lands in PR 6.3.
`.agent-plan.md` Phase 6 entry updated to mark PR 6.2 complete.
Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff +
mypy clean; leakage probes 0/3 on every tier; hash determinism
PASS 67/67; `validate_release_candidate --no-rebuild` exits 0;
`BUNDLE_SCHEMA_VERSION` unchanged at 5.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds the next two release notebooks (03 and 04) to the v1 educational notebook sequence, along with builder scripts and CI gates to keep the notebooks deterministic, executable, and audit-synced to the validation report.
Changes:
- Add Notebook 03 (leakage + time windows) and Notebook 04 (lift/calibration/value ranking/cohort shift/bootstrap), plus their deterministic builders.
- Extend notebook CI coverage: execute the new notebooks end-to-end and include their builders in byte-stability checks.
- Add a
cohort_shift.intermediatetargets block and a dedicated audit-sync test to ensure it matchesvalidation_report.json.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/scripts/test_release_notebook_builders.py |
Adds notebook 03/04 builders to the byte-stability + “matches committed” gate. |
tests/release/notebooks/test_release_targets_match_report.py |
Skips cohort_shift in the tier-median loop and adds a dedicated cohort-shift audit-sync test. |
tests/release/notebooks/test_execute_notebooks.py |
Adds notebook 03/04 to the nbclient execution matrix. |
scripts/build_release_notebook_03.py |
New deterministic builder that generates notebook 03 content. |
scripts/build_release_notebook_04.py |
New deterministic builder that generates notebook 04 content and pins metrics/tolerances. |
release/notebooks/03_leakage_and_time_windows.ipynb |
Generated notebook 03 committed artifact. |
release/notebooks/04_lift_calibration_value_ranking.ipynb |
Generated notebook 04 committed artifact. |
release/notebooks/_release_targets.json |
Adds cohort_shift.intermediate pinned values for notebook 04’s cohort-shift reproduction gate. |
.agent-plan.md |
Marks Phase 6.2 notebook work as complete and documents the delivered content. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…es, kill dead code
Brutal-reviewer pass on PR 6.2 caught real issues. Fixes folded back
in before requesting human review.
Narrative contradictions / wrong numbers (must-fix)
- NB04 §7 opened with "train on the first 70 % chronologically"
but the code uses 85/15 (matching the validator's
COHORT_TRAIN_FRAC=0.85). Whiplash inside one section. Rewrote
the opening to use 85/15 from the start and tightened the
explanation of why random_state=0 also matters.
- NB04 §10 summary cited max bin error ≈ 0.19 and top-decile
lift ≈ 2.6×. Both were copied from with-trap report values; the
headline panel drops the trap, and the actual observations are
0.13 and 2.75×. Corrected, with a one-line note distinguishing
the trap-dropped vs trap-kept numbers.
- NB03 §4 cited the report's standalone-AUC of "0.55" and then
observed 0.53 without explaining the gap. Added a sentence
pinning the methodological difference: the report fits a full
LR pipeline on the single column; the notebook uses the raw
column as a score directly. Both fall in "barely above
chance"; the difference is real and now owned.
Tolerance gates that didn't gate the claim (should-fix)
- NB03 trap_standalone_auc was pinned to a ±0.02 band around
0.531 — which silently allows the trap to drop to 0.49 (below
chance) while the test still passes, breaking section 5's
setup. Added a sign-aware ``assert standalone[TRAP] > 0.51``.
- NB03 §3 ``mean_delta > 0`` was performative (passed if even
one lead had a positive delta). Hardened to
``mean_delta > 1.0`` AND ``n_post / len(window) > 0.5``,
sitting well below the observed 3.22 / 82 % but well above
"barely positive."
- NB04 §5 value-aware ranking guard was ``gain >= -0.01``
(allows ranking to *lose* ground by up to 0.01). Strengthened
to ``gain > MIN_VALUE_GAIN = 0.05`` so a regression that
erodes the P×ACV story actually fails CI.
- Audit-sync test for the cohort_shift block had
``if not cohort_targets: return`` — silently passed if
someone deleted or renamed the block. Made the block
required; if notebook 04 ever stops needing it, the test
should be deleted, not bypassed.
Code quality (should-fix)
- NB04 ``acv_capture(scores, k)`` re-sorted the score array on
every call (60+ calls × O(N log N) per call). Pre-compute the
argsort + cumsum once; every plot point is now a cheap
cumulative-array lookup. Function signature changed from
``(scores, k)`` to ``(use_value: bool, k)`` and the §9
tolerance-gate call sites were updated to match.
- NB04 §6 narrative promised "precision / recall / count above
threshold" but the plots showed only count and precision.
Added a third panel for recall above threshold so the prose
matches the figure.
- NB03 dead code: ``with_trap_p100`` / ``without_trap_p100``
keys were computed via ``precision_at_k(...)`` but never
read downstream. Removed both keys and the now-unused
``precision_at_k`` import.
- NB03 §4 ``standalone_window = window.dropna(...).copy()``
was a no-op (window was already dropped of NaN in §3).
Reuse window directly with an inline comment about why.
- NB03 §1 ``HORIZON_DAYS`` was read from the manifest and
printed but never used in the narrative. Wove it into the
setup print so the 60-day post-snapshot hunting window is
explicit before §3 measures it.
- NB04 §5 plot title ``"Value-aware ranking captures more
revenue per outreach slot"`` was a thesis used as a label.
Made it neutral ("ACV capture vs top-K ..."); the conclusion
stays in the narrative.
All 28 notebook builder + execution + audit-sync tests still pass;
1260/1260 full-suite tests pass; ruff + mypy clean; both notebooks
execute end-to-end in <10 s each.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Copilot's review of the PR 6.2 self-review surfaced four still-real
findings (two others were marked outdated — already addressed by the
self-review pass). Folded all four back in before requesting human
review.
- (COPILOT-1) Grammar: "reflect a honest production model" →
"reflect an honest production model" in NB04 §2 narrative.
- (COPILOT-3) The threshold-selection comment claimed the K-th
highest probability + ``mask = lr_probs >= threshold`` admits
*exactly* K leads, "ties resolved by score order". The
inclusive comparison can admit more than K when leads share the
threshold's probability — and there is no implicit tie-break.
Rewrote the comment to be honest about the semantics ("admits
AT LEAST K via probs >= threshold; ties at the threshold can
inflate the slate; ``actually_above`` makes the realised count
visible"). Kept the threshold-based selection rather than
switching to a true top-K via ``argsort`` because the
pedagogical point of section 6 is *threshold selection*, not
*rank cutoff*.
- (COPILOT-4) Bootstrap loop comment said "Degenerate resample —
re-roll" but the implementation writes NaN and continues.
Rewrote the comment to match what the code does (mark NaN, let
``_summary`` filter it out) and added the actual probability
bound — with n_test=750 and base rate ~22 %, the all-positive
or all-negative draw probability is ~10⁻¹⁰⁰, so the branch is
dead in practice and exists only as a defensive safety net for
tiny test sets. Implementing a real re-roll loop would never
execute on this dataset.
- (COPILOT-6) The top-level ``_doc`` in
``release/notebooks/_release_targets.json`` claimed the file
contains only "cross-seed-median metric values", which became
inaccurate after the PR 6.2 cohort_shift block (single-seed,
seed 42). Rewrote the docstring to call out the mixed
structure: per-tier blocks hold cross-seed medians; cohort_shift
block holds single-seed values from
``validation_report.cohort_shift``. Audit-sync test continues
to enforce both cases via separate loops.
- (COPILOT-2, COPILOT-5) outdated; already addressed by the
self-review fix-up commit (25b9ec1) — the 70/30-vs-85/15
narrative whiplash and the wrong summary numbers (0.19 / 2.6×
vs the actual 0.13 / 2.75×) are both fixed there. Resolved as
already-treated.
Net: 28/28 notebook builder + execution + audit-sync tests pass;
ruff + mypy clean; both notebooks still execute end-to-end in
<10 s each.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: This run includes unresolved review comments on PR #74 in repository https://github.com/leadforge-dev/leadforge
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: scripts/build_release_notebook_04.py
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276155
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
Grammar: “reflect a honest production model” should be “reflect an honest production model”.
## COPILOT-2
Location: scripts/build_release_notebook_04.py
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276229
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
The cohort-shift description says “train on the first 70% … score the last 30%”, but immediately below (and in the code) the split is 85%/15% (COHORT_TRAIN_FRAC=0.85). Please update the prose so it matches the implemented/validated split to avoid confusing readers and to keep the “mirrors the validator” claim precise.
## COPILOT-3
Location: scripts/build_release_notebook_04.py:433
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276269
Root author: copilot-pull-request-reviewer
Comment:
The comment says the K-th highest probability threshold “admits exactly K leads (ties resolved by score order)”, but using `mask = lr_probs >= threshold` can admit more than K when there are ties at the threshold, and there’s no explicit tie-break. Either adjust the comment/prose, or implement a true top-K selection (e.g., via argsort indices) if you want exactly CAPACITY leads.
## COPILOT-4
Location: scripts/build_release_notebook_04.py:664
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276306
Root author: copilot-pull-request-reviewer
Comment:
Bootstrap loop: the code comment says “Degenerate resample — re-roll”, but the implementation writes NaNs and continues. Since the arrays are later used for quantiles/histograms, it’d be more robust to actually re-draw indices until non-degenerate (or explicitly filter finite values everywhere and handle the all-NaN case).
## COPILOT-5
Location: scripts/build_release_notebook_04.py
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276331
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
Summary numbers appear inconsistent with the pinned/printed metrics above (e.g., max bin error is pinned at 0.1344 in NB04_TARGETS and lift@10% is ~2.75, but the summary says ≈0.19 and ~2.6×). Please update the summary text so it reflects the actual computed values for this notebook run.
## COPILOT-6
Location: release/notebooks/_release_targets.json
URL: https://github.com/leadforge-dev/leadforge/pull/74#discussion_r3200276360
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
The top-level `_doc` says this file contains only “Cross-seed-median metric values”, but the new `cohort_shift` block is explicitly single-seed (seed 42). Consider updating the top-level docstring to reflect that the file now contains a mix of cross-seed medians (per-tier metrics) and single-seed cohort-shift metrics, so downstream readers don’t assume everything here is a median.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
total_touches_alltrap into a teaching moment. Reads the trap label off the feature dictionary; proves the trap by construction via a same-table comparison oftotal_touches_all(full-horizon) vstouch_count(snapshot-safe) — mean post-snapshot delta 3.2 touches/lead, 82 % of leads positive. Standalone-AUC probe (~0.53, looks innocuous) ⇒ full-panel ± trap ablation showing HistGBM extracts +0.032 AUC where LR only squeezes +0.009. Pedagogical headline: standalone AUC probes undersell tree-friendly leakage.expected_acv × P(convert)value-aware ranking (top-50 ACV-capture jumps 0.16 → 0.40 vs P-only), threshold selection for fixed top-K capacity, cohort-shift evaluation that reproduces the validation report'scohort_shift.intermediateblock exactly (0.8754 / 0.8908 / −0.0155), and a 200-iter bootstrap of the test set as the within-bundle confidence band — explicitly framed as the proxy for true cross-seed sweep, since public-bundle consumers can't rebuild bundles withoutleadforgeinstalled.cohort_shift.intermediateblock inrelease/notebooks/_release_targets.json, newtest_cohort_shift_targets_match_validation_reportextension to the existing audit-sync test, both new builders inherit PR 6.1's deterministic-cell-ID +--outbyte-stability pattern, both new notebooks added to the byte-stability and end-to-end execution test parametrisations.Design decisions worth flagging
Notebook 03 framing pivot. The brief proposed an "engagement-only / firmographic-only feature panel ± trap" pedagogy on the assumption that the trap would dominate a thin set. Empirical AUCs say firmographic-only is at chance even with the trap (0.49–0.53), so the pedagogy doesn't hold. The reframed lesson — GBM extracts more from the trap than its standalone AUC predicts; LR doesn't — is honest, comes from the data, and produces a meaningful sign-aware tolerance gate (
gbm_lift > 0.015).Notebook 04 trap posture. Headline LR/GBM panel drops the trap (matches notebook 02; gives honest production numbers). Cohort-shift section keeps the trap to reproduce the report's published cohort-shift numbers exactly. Both postures are explained inline in their respective sections.
Bootstrap, not seed sweep. Resamples test-set indices 200× and recomputes already-scored AUC/AP — sub-second runtime, captures sampling noise on a single generated world but not generation-process noise across seeds. PR 6.1 forward-pointed to a "seed-sweep harness" but a true cross-seed sweep would require regenerating bundles, which Kaggle/HF consumers can't do; bootstrap is the honest within-bundle substitute and the notebook says so.
Cohort-shift reproduction. First implementation used
COHORT_TRAIN_FRAC=0.7andrandom_state=42; got cohort_split_auc=0.8854 (close but not equal to the report's 0.8908). Self-review caught the validator usesCOHORT_TRAIN_FRAC=0.85andmodel_random_state=0; switching matched the report to four decimal places.Acceptance gates (from
docs/release/v1_acceptance_gates.md)assert manifest["exposure_mode"] == "student_public"and load only fromrelease/intermediate/.Test plan
ruff check .→ all checks passedruff format --check .→ 191 files already formattedmypy leadforge/→ no issues found in 82 source filespytest -x -q→ 1260 passed, 5 publish-extra-gated skips, 1m58spytest tests/scripts/test_release_notebook_builders.py tests/release/notebooks/→ 28 passed (byte-stability + nbclient execution + targets-match-report audit-sync)python scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65→ 0/3 paths flag rate on every tierpython scripts/verify_hash_determinism.py→ PASS 67/67python scripts/validate_release_candidate.py --no-rebuild→ exits 0; rewrotevalidation_report.{json,md}timestamps which were reverted before commitBUNDLE_SCHEMA_VERSIONunchanged at 5Out of scope (PR 6.3)
docs/release/break_me_guide.md— forward-pointed to as plain backtick-wrapped text in both notebooks (no dead Markdown link)..github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.ymldocs/release/v2_decision_log.mdrelease/README.mdlink update🤖 Generated with Claude Code