Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,13 +95,14 @@ _Source: `docs/external_review/summaries/v1_release_review_synthesis.md` — cro
- Labels: `type: docs`, `layer: render`, `layer: validation`
- Size: M (~250 lines across multiple docs)

- [ ] **PR 8.3** — `docs(notebooks): teaching improvements`
- **Fix stale internal forward-references** (MEDIUM): Notebooks 01 and 02 still say "Notebook 03 *(coming in PR 6.2)*" and "Notebook 04 *(coming in PR 6.2)*." Internal PR/phase numbers should not appear in published teaching material.
- **Add prominent banner to Notebook 01** (MEDIUM): nb01 deliberately keeps `total_touches_all` to reproduce the validation panel; a beginner lifting the feature selection block inherits the trap. Add a two-cell banner: "⚠️ This notebook reproduces the published validation panel and intentionally includes the leakage trap. Start at Notebook 02 for clean modelling."
- **Add "switch to Advanced, watch calibration break" cell to Notebook 04** (MEDIUM): nb04 teaches calibration on Intermediate (max-bin error ~0.13, looks good). Advanced is at 0.52 and students are never shown it. A single `BUNDLE = Path("../advanced")` swap with commentary closes the gap.
- **Add `GroupKFold(account_id)` section to Notebook 02 or 04** (MEDIUM): 93% account overlap is the README's top disclosed limitation but no notebook demonstrates it. Add: train on account-split train set, evaluate on unseen accounts, show metric delta vs. random split.
- [x] **PR 8.3** — `docs(notebooks): teaching improvements`
- **Fix stale internal forward-references** (MEDIUM): All "*(coming in PR 6.2)*" refs removed from nb01 (§4 prose + §10 Next), nb02 (§8 Honest takeaway + Next). Notebooks 03 and 04 are now shipped; internal PR numbers removed from published teaching material.
- **Add warning banner to Notebook 01** (MEDIUM): `build_release_notebook_01.py` inserts a callout block after the title cell: "⚠️ Validation-panel notebook — leakage trap retained intentionally. Start at Notebook 02 for clean modelling."
- **Add Advanced-tier calibration demo to Notebook 04** (MEDIUM): §3a added — loads `../advanced`, runs same LR pipeline, shows side-by-side reliability diagram (intermediate max-bin err ≈0.13 vs advanced ≈0.52). Confirms AUC barely moves across tiers; calibration is the discriminating metric. Implemented in `build_release_notebook_04.py`.
- **Add `GroupKFold(account_id)` section to Notebook 02** (MEDIUM): §9 added — pools train+test, runs 5-fold account-grouped CV with LR, prints per-fold AUC, reports optimism in the headline random-split AUC. Demonstrates the 93% overlap limitation concretely. Implemented in `build_release_notebook_02.py`.
- Changes applied to builder scripts (canonical source); notebooks regenerated and verified byte-stable by builder tests.
- Labels: `type: docs`, `layer: render`
- Size: S (~200 lines across 4 notebooks)
- Size: S (~200 lines across 3 builder scripts)

- [ ] **PR 8.4** — `feat(scripts): integration script + preview hardening`
- **Regenerate lockfile + bump to v1.0.1** (HIGH): delete `package-lock.json`, update `package.json` pin to `github:ShmuggingFace/ShmuggingFaceCore#v1.0.1`, regenerate via HTTPS. Fixes SSH lockfile and gets the socks/laundry copy fix in one step.
Expand Down
6 changes: 4 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,10 @@ jobs:
- run: pip install -e ".[dev,scripts,notebooks]"
- name: Register python3 kernelspec for nbclient
run: python -m ipykernel install --user --name python3
- name: Build the intermediate public bundle (only tier the notebooks need)
run: python scripts/build_public_release.py release --tier intermediate
- name: Build intermediate and advanced public bundles (needed by nb04 §4)
run: |
python scripts/build_public_release.py release --tier intermediate
python scripts/build_public_release.py release --tier advanced
- name: Execute release notebooks end-to-end + builder byte-stability
run: |
pytest tests/release/notebooks/test_execute_notebooks.py \
Expand Down
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -239,3 +239,10 @@ release/huggingface-instructor/*
# under release/_preview_committed/ is the audit-artefact-sync gate
# and is checked into git separately.
release/_preview/

# ShmuggingFace mock-review site (PR 7.2 tooling) — Node.js install +
# generated static site + Cloudflare Pages cache. None of these are
# repo artifacts; they are rebuilt on demand.
node_modules/
.wrangler/
release/_shmuggingface/
10 changes: 5 additions & 5 deletions release/claims_register.json
Original file line number Diff line number Diff line change
Expand Up @@ -45,31 +45,31 @@
"backing_path": "$.tiers.<tier>.medians.lr_auc",
"category": "calibration",
"id": "c06",
"text": "Cross-seed median LR AUC: intro 0.879, intermediate 0.886, advanced 0.886.",
"text": "Cross-seed median LR AUC: intro 0.671, intermediate 0.663, advanced 0.624.",
"verifier": "scripts/validate_release_candidate.py"
},
{
"backing_artifact": "release/metrics.json",
"backing_path": "$.tiers.<tier>.medians.lr_average_precision",
"category": "calibration",
"id": "c07",
"text": "Cross-seed median LR Average Precision: intro 0.761, intermediate 0.575, advanced 0.351.",
"text": "Cross-seed median LR Average Precision: intro 0.555, intermediate 0.332, advanced 0.122.",
"verifier": "scripts/validate_release_candidate.py"
},
{
"backing_artifact": "release/metrics.json",
"backing_path": "$.tiers.<tier>.medians.precision_at_100",
"category": "calibration",
"id": "c08",
"text": "Cross-seed median P@100: intro 0.80, intermediate 0.59, advanced 0.34.",
"text": "Cross-seed median P@100: intro 0.60, intermediate 0.33, advanced 0.11.",
"verifier": "scripts/validate_release_candidate.py"
},
{
"backing_artifact": "release/metrics.json",
"backing_path": "$.tiers.<tier>.medians.brier_score",
"category": "calibration",
"id": "c09",
"text": "Cross-seed median Brier score: intro 0.130, intermediate 0.110, advanced 0.061.",
"text": "Cross-seed median Brier score: intro 0.220, intermediate 0.160, advanced 0.076.",
"verifier": "scripts/validate_release_candidate.py"
},
{
Expand All @@ -93,7 +93,7 @@
"backing_path": "$.tiers.<tier>.medians.gbm_minus_lr_auc",
"category": "limitations",
"id": "c12",
"text": "GBM-LR AUC delta is slightly negative in every tier (-0.0045 / -0.0072 / -0.0133); v1's snapshot is dominated by linear features.",
"text": "GBM-LR AUC delta is negative in every tier (-0.011 / -0.018 / -0.024); v1's snapshot is dominated by linear features.",
"verifier": "scripts/validate_release_candidate.py"
},
{
Expand Down
10 changes: 5 additions & 5 deletions release/claims_register.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ twin of this document with the same data plus a schema block.
| ID | Claim | Backing artifact | Path | Verifier |
|---|---|---|---|---|
| `c05` | Conversion rate (cross-seed median, seeds 42-46): intro 42.67%, intermediate 21.60%, advanced 8.40%. | `release/metrics.json` | `$.tiers.<tier>.medians.conversion_rate_test` | `scripts/validate_release_candidate.py` |
| `c06` | Cross-seed median LR AUC: intro 0.879, intermediate 0.886, advanced 0.886. | `release/metrics.json` | `$.tiers.<tier>.medians.lr_auc` | `scripts/validate_release_candidate.py` |
| `c07` | Cross-seed median LR Average Precision: intro 0.761, intermediate 0.575, advanced 0.351. | `release/metrics.json` | `$.tiers.<tier>.medians.lr_average_precision` | `scripts/validate_release_candidate.py` |
| `c08` | Cross-seed median P@100: intro 0.80, intermediate 0.59, advanced 0.34. | `release/metrics.json` | `$.tiers.<tier>.medians.precision_at_100` | `scripts/validate_release_candidate.py` |
| `c09` | Cross-seed median Brier score: intro 0.130, intermediate 0.110, advanced 0.061. | `release/metrics.json` | `$.tiers.<tier>.medians.brier_score` | `scripts/validate_release_candidate.py` |
| `c06` | Cross-seed median LR AUC: intro 0.671, intermediate 0.663, advanced 0.624. | `release/metrics.json` | `$.tiers.<tier>.medians.lr_auc` | `scripts/validate_release_candidate.py` |
| `c07` | Cross-seed median LR Average Precision: intro 0.555, intermediate 0.332, advanced 0.122. | `release/metrics.json` | `$.tiers.<tier>.medians.lr_average_precision` | `scripts/validate_release_candidate.py` |
| `c08` | Cross-seed median P@100: intro 0.60, intermediate 0.33, advanced 0.11. | `release/metrics.json` | `$.tiers.<tier>.medians.precision_at_100` | `scripts/validate_release_candidate.py` |
| `c09` | Cross-seed median Brier score: intro 0.220, intermediate 0.160, advanced 0.076. | `release/metrics.json` | `$.tiers.<tier>.medians.brier_score` | `scripts/validate_release_candidate.py` |

## composition

Expand Down Expand Up @@ -45,7 +45,7 @@ twin of this document with the same data plus a schema block.

| ID | Claim | Backing artifact | Path | Verifier |
|---|---|---|---|---|
| `c12` | GBM-LR AUC delta is slightly negative in every tier (-0.0045 / -0.0072 / -0.0133); v1's snapshot is dominated by linear features. | `release/metrics.json` | `$.tiers.<tier>.medians.gbm_minus_lr_auc` | `scripts/validate_release_candidate.py` |
| `c12` | GBM-LR AUC delta is negative in every tier (-0.011 / -0.018 / -0.024); v1's snapshot is dominated by linear features. | `release/metrics.json` | `$.tiers.<tier>.medians.gbm_minus_lr_auc` | `scripts/validate_release_candidate.py` |
| `c13` | lead_source is weakly informative — out-of-sample univariate AUC ~0.50-0.52 across tiers, per-channel rate spread <=0.05. | `release/docs/channel_signal_audit.md` | `n/a (prose)` | `scripts/audit_channel_signal.py` |
| `c14` | Cohort-shift AUC degradation is small (v1 has no time-of-year drift baked in). | `release/metrics.json` | `$.cohort_shift.<tier>.auc_degradation` | `scripts/validate_release_candidate.py` |

Expand Down
10 changes: 5 additions & 5 deletions release/claims_register_source.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,28 +57,28 @@ claims:
verifier: scripts/validate_release_candidate.py

- id: c06
text: "Cross-seed median LR AUC: intro 0.879, intermediate 0.886, advanced 0.886."
text: "Cross-seed median LR AUC: intro 0.671, intermediate 0.663, advanced 0.624."
category: calibration
backing_artifact: release/metrics.json
backing_path: $.tiers.<tier>.medians.lr_auc
verifier: scripts/validate_release_candidate.py

- id: c07
text: "Cross-seed median LR Average Precision: intro 0.761, intermediate 0.575, advanced 0.351."
text: "Cross-seed median LR Average Precision: intro 0.555, intermediate 0.332, advanced 0.122."
category: calibration
backing_artifact: release/metrics.json
backing_path: $.tiers.<tier>.medians.lr_average_precision
verifier: scripts/validate_release_candidate.py

- id: c08
text: "Cross-seed median P@100: intro 0.80, intermediate 0.59, advanced 0.34."
text: "Cross-seed median P@100: intro 0.60, intermediate 0.33, advanced 0.11."
category: calibration
backing_artifact: release/metrics.json
backing_path: $.tiers.<tier>.medians.precision_at_100
verifier: scripts/validate_release_candidate.py

- id: c09
text: "Cross-seed median Brier score: intro 0.130, intermediate 0.110, advanced 0.061."
text: "Cross-seed median Brier score: intro 0.220, intermediate 0.160, advanced 0.076."
category: calibration
backing_artifact: release/metrics.json
backing_path: $.tiers.<tier>.medians.brier_score
Expand All @@ -99,7 +99,7 @@ claims:
verifier: leadforge inspect

- id: c12
text: "GBM-LR AUC delta is slightly negative in every tier (-0.0045 / -0.0072 / -0.0133); v1's snapshot is dominated by linear features."
text: "GBM-LR AUC delta is negative in every tier (-0.011 / -0.018 / -0.024); v1's snapshot is dominated by linear features."
category: limitations
backing_artifact: release/metrics.json
backing_path: $.tiers.<tier>.medians.gbm_minus_lr_auc
Expand Down
134 changes: 67 additions & 67 deletions release/metrics.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,21 @@
},
"cohort_shift": {
"advanced": {
"auc_degradation": 0.0098,
"cohort_split_auc": 0.8628,
"random_split_auc": 0.8726,
"auc_degradation": -0.0448,
"cohort_split_auc": 0.578,
"random_split_auc": 0.5331,
"seed": 42
},
"intermediate": {
"auc_degradation": -0.0155,
"cohort_split_auc": 0.8908,
"random_split_auc": 0.8754,
"auc_degradation": 0.0592,
"cohort_split_auc": 0.5933,
"random_split_auc": 0.6524,
"seed": 42
},
"intro": {
"auc_degradation": 0.0156,
"cohort_split_auc": 0.8573,
"random_split_auc": 0.8729,
"auc_degradation": -0.0076,
"cohort_split_auc": 0.656,
"random_split_auc": 0.6485,
"seed": 42
}
},
Expand Down Expand Up @@ -52,7 +52,7 @@
"precision_at_100_intermediate_gt_advanced": true,
"precision_at_100_intro_gt_intermediate": true
},
"generation_timestamp": "2026-05-06T07:38:31+00:00",
"generation_timestamp": "2026-05-26T21:23:32+00:00",
"notes": "Headline metrics surfaced in the README are cross-seed medians over the canonical N=5 sweep (seeds 42-46). Per-seed values live under tiers.<tier>.per_seed in validation_report.json.",
"package_version": "1.0.0",
"release_id": "leadforge-lead-scoring-v1",
Expand Down Expand Up @@ -83,17 +83,17 @@
"yaml_path": "advanced"
},
"medians": {
"brier_score": 0.0611,
"calibration_max_bin_error": 0.5234,
"brier_score": 0.0758,
"calibration_max_bin_error": 0.221,
"conversion_rate_test": 0.084,
"gbm_auc": 0.8726,
"gbm_average_precision": 0.3239,
"gbm_minus_lr_auc": -0.0133,
"log_loss": 0.1947,
"lr_auc": 0.8861,
"lr_average_precision": 0.3514,
"precision_at_100": 0.34,
"top_decile_rate": 0.3333
"gbm_auc": 0.6003,
"gbm_average_precision": 0.1225,
"gbm_minus_lr_auc": -0.0242,
"log_loss": 0.2802,
"lr_auc": 0.6236,
"lr_average_precision": 0.1218,
"precision_at_100": 0.11,
"top_decile_rate": 0.1067
},
"n_seeds": 5,
"seeds": [
Expand All @@ -108,16 +108,16 @@
"json_path": "$.tiers.advanced"
},
"spreads_max_minus_min": {
"brier_score": 0.0152,
"calibration_max_bin_error": 0.4828,
"brier_score": 0.0156,
"calibration_max_bin_error": 0.5634,
"conversion_rate_test": 0.02,
"gbm_auc": 0.0171,
"gbm_average_precision": 0.0324,
"gbm_minus_lr_auc": 0.0251,
"log_loss": 0.0535,
"lr_auc": 0.0401,
"lr_average_precision": 0.0814,
"top_decile_rate": 0.0533
"gbm_auc": 0.1056,
"gbm_average_precision": 0.0605,
"gbm_minus_lr_auc": 0.0202,
"log_loss": 0.056,
"lr_auc": 0.1,
"lr_average_precision": 0.056,
"top_decile_rate": 0.0667
},
"tier": "advanced"
},
Expand All @@ -136,17 +136,17 @@
"yaml_path": "intermediate"
},
"medians": {
"brier_score": 0.1096,
"calibration_max_bin_error": 0.249,
"brier_score": 0.1604,
"calibration_max_bin_error": 0.2785,
"conversion_rate_test": 0.216,
"gbm_auc": 0.8755,
"gbm_average_precision": 0.5621,
"gbm_minus_lr_auc": -0.0072,
"log_loss": 0.33,
"lr_auc": 0.8859,
"lr_average_precision": 0.5752,
"precision_at_100": 0.59,
"top_decile_rate": 0.5867
"gbm_auc": 0.6339,
"gbm_average_precision": 0.2912,
"gbm_minus_lr_auc": -0.0179,
"log_loss": 0.4891,
"lr_auc": 0.6625,
"lr_average_precision": 0.3318,
"precision_at_100": 0.33,
"top_decile_rate": 0.32
},
"n_seeds": 5,
"seeds": [
Expand All @@ -161,16 +161,16 @@
"json_path": "$.tiers.intermediate"
},
"spreads_max_minus_min": {
"brier_score": 0.0161,
"calibration_max_bin_error": 0.3215,
"brier_score": 0.0202,
"calibration_max_bin_error": 0.3632,
"conversion_rate_test": 0.0467,
"gbm_auc": 0.027,
"gbm_average_precision": 0.0593,
"gbm_minus_lr_auc": 0.0152,
"log_loss": 0.035,
"lr_auc": 0.023,
"lr_average_precision": 0.0863,
"top_decile_rate": 0.12
"gbm_auc": 0.0517,
"gbm_average_precision": 0.1004,
"gbm_minus_lr_auc": 0.0384,
"log_loss": 0.0503,
"lr_auc": 0.0594,
"lr_average_precision": 0.1237,
"top_decile_rate": 0.1333
},
"tier": "intermediate"
},
Expand All @@ -189,17 +189,17 @@
"yaml_path": "intro"
},
"medians": {
"brier_score": 0.1301,
"calibration_max_bin_error": 0.2497,
"brier_score": 0.2197,
"calibration_max_bin_error": 0.1761,
"conversion_rate_test": 0.4267,
"gbm_auc": 0.8729,
"gbm_average_precision": 0.7527,
"gbm_minus_lr_auc": -0.0045,
"log_loss": 0.4008,
"lr_auc": 0.8788,
"lr_average_precision": 0.7608,
"precision_at_100": 0.8,
"top_decile_rate": 0.7733
"gbm_auc": 0.6838,
"gbm_average_precision": 0.548,
"gbm_minus_lr_auc": -0.0105,
"log_loss": 0.6273,
"lr_auc": 0.6708,
"lr_average_precision": 0.5547,
"precision_at_100": 0.6,
"top_decile_rate": 0.6133
},
"n_seeds": 5,
"seeds": [
Expand All @@ -214,16 +214,16 @@
"json_path": "$.tiers.intro"
},
"spreads_max_minus_min": {
"brier_score": 0.0184,
"calibration_max_bin_error": 0.196,
"brier_score": 0.0293,
"calibration_max_bin_error": 0.1288,
"conversion_rate_test": 0.092,
"gbm_auc": 0.0232,
"gbm_average_precision": 0.06,
"gbm_minus_lr_auc": 0.0225,
"log_loss": 0.0557,
"lr_auc": 0.0272,
"lr_average_precision": 0.067,
"top_decile_rate": 0.08
"gbm_auc": 0.1214,
"gbm_average_precision": 0.1207,
"gbm_minus_lr_auc": 0.054,
"log_loss": 0.0655,
"lr_auc": 0.0871,
"lr_average_precision": 0.1041,
"top_decile_rate": 0.12
},
"tier": "intro"
}
Expand Down
Loading
Loading