leadforge-dev · shaypal5 · May 25, 2026 · May 25, 2026 · May 25, 2026
diff --git a/.agent-plan.md b/.agent-plan.md
@@ -81,16 +81,17 @@ _Source: `docs/external_review/summaries/v1_release_review_synthesis.md` — cro
   - **Label window: `<` → `<=`** (MEDIUM): Fixed in `engine.py`; test updated to use inclusive assertion.
   - **Regenerated all three public tier bundles**: intro 41.5% conv rate, intermediate 20.1%, advanced 7.9%. Validation: PASS — 3 tiers, 5 seeds, 0 leakage findings. 1439 tests pass.
 
-- [ ] **PR 8.2** — `docs(release): difficulty-axis reframe + disclosure hardening`
-  - **Reframe difficulty axis throughout all copy** (HIGH): README, dataset card, Kaggle/HF metadata, tier table, notebook headers. Change from "Intro / Intermediate / Advanced" framing of modelling difficulty to explicit prevalence/noise tier framing. Recommended: "Intro = high-prevalence classroom warm-up; Intermediate = default benchmark; Advanced = low-prevalence, calibration, and noise-handling exercise." AUC is flat across tiers; the "three difficulty tiers" framing is misleading to anyone who reads it as model complexity.
-  - **Add `calibration_max_bin_error` to README calibration table** (HIGH): the advanced tier is at 0.52 max-bin error; the current table shows only Brier, which *improves* with prevalence and actively misleads. One row added.
-  - **Clarify acceptance bands are descriptive regression fences, not realism thresholds** (HIGH): `docs/release/v1_acceptance_gates.md` — the YAML inline comments already say this; the README does not. Small doc edit, large trust impact.
-  - **Fix `isPrivate: true`** (HIGH): `release/kaggle/dataset-metadata.json` — one character; absolute publish blocker.
-  - **Change HF default config to `intro`** (MEDIUM): `release/huggingface/README.md` YAML — `default: true` on the `intermediate` config means `load_dataset("leadforge/...")` with no arguments skips the intro tier. Students should land in the easiest tier by default.
-  - **Remove `intermediate_instructor/` from public README tree** (MEDIUM): the instructor bundle reconstructs the label by construction; listing it in the public-facing "what's inside" tree is a redaction bypass risk. Verify gating is correct and scrub from public copy.
-  - **Elevate 93% account overlap to primary evaluation warning** (MEDIUM): move above the tier table in README; add: "headline metrics are random-split; for production-representative evaluation use `GroupKFold(account_id)` — see Notebook 02."
-  - **Add "non-physical values" to known limitations** (MEDIUM): one bullet: "Advanced-tier noise can make some bounded/time/count-like proxies non-physical (e.g. negative duration values); treat these as synthetic distortion artifacts."
-  - **Reconcile CLAUDE.md canonical package layout** (LOW): delete or annotate aspirational modules that don't exist; add modules that do but are missing from the layout doc.
+- [x] **PR 8.2** — `docs(release): difficulty-axis reframe + disclosure hardening`
+  - **Reframe difficulty axis throughout all copy** (HIGH): README, dataset card, Kaggle/HF metadata, tier table, notebook headers. Tiers now explicitly framed as prevalence/noise axes (Intro=high-prevalence, Intermediate=default benchmark, Advanced=low-prevalence + calibration challenge). Added "Reading this table" note explaining flat AUC. AUC is flat across tiers by design; the "difficulty" framing is gone.
+  - **Add `calibration_max_bin_error` to README calibration table** (HIGH): Advanced tier at 0.52 max-bin error now visible alongside Brier score.
+  - **Clarify acceptance bands are descriptive regression fences, not realism thresholds** (HIGH): added blockquote to `docs/release/v1_acceptance_gates.md` under Performance gates heading.
+  - **Fix `isPrivate: true`** (HIGH): `release/kaggle/dataset-metadata.json` — `isPrivate: false` now; absolute publish blocker resolved.
+  - **Change HF default config to `intro`** (MEDIUM): `DEFAULT_DEFAULT_CONFIG = "intro"` in `scripts/package_hf_release.py`; `release/huggingface/README.md` regenerated with `intro` as default. Students landing on `load_dataset("leadforge/...")` with no args now get the easiest tier.
+  - **Remove `intermediate_instructor/` from public README tree** (MEDIUM): scrubbed from `release/README.md`, `release/huggingface/README.md`, and `SOURCE_TREE_BLOCK` in `scripts/_release_common.py`.
+  - **Elevate 93% account overlap to primary evaluation warning** (MEDIUM): "Evaluation note — account overlap" section added BEFORE tier table in both READMEs.
+  - **Add "non-physical values" to known limitations** (MEDIUM): bullet added to Known limitations section.
+  - **Reconcile CLAUDE.md canonical package layout** (LOW): deferred — not blocking publish.
+  - Preview committed samples updated: `release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html`
   - Labels: `type: docs`, `layer: render`, `layer: validation`
   - Size: M (~250 lines across multiple docs)
 

diff --git a/docs/release/v1_acceptance_gates.md b/docs/release/v1_acceptance_gates.md
@@ -66,6 +66,15 @@ Bands fitted to the PR 3.3 N=5 sweep on `release/{intro,intermediate,advanced}/`
 All numeric bands live in `v1_acceptance_gates_bands.yaml`; medians and
 rationale follow.
 
+> **These bands are regression fences, not realism thresholds.**
+> They are calibrated to the observed five-seed spread for this DGP and
+> recipe configuration. A band being "wide" does not mean any value within
+> it is equally realistic — it means the validator will not flag a new
+> bundle as broken unless a metric drifts *outside* that window. The medians
+> in each gate note are the meaningful targets; bands only fire on
+> substantial unintended regressions. Tightening the bands is expected work
+> when the DGP is redesigned for v2.
+
 ### Intro tier
 - **G7.1.1** Conversion rate within **[0.24, 0.61]**.  Median 0.4267.
 - **G7.1.2** LR AUC within **[0.82, 0.94]**.  Median 0.8788.

diff --git a/release/README.md b/release/README.md
@@ -35,7 +35,6 @@ release/
 │   ├── lead_scoring.csv              # flat convenience CSV (all splits)
 │   ├── tables/*.parquet              # 7 snapshot-safe relational tables
 │   └── tasks/converted_within_90_days/{train,valid,test}.parquet
-├── intermediate_instructor/          # research companion: full-horizon tables + metadata/
 ├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)
 ├── notebooks/                        # 01 baseline · 02 relational · 03 leakage · 04 calibration
 ├── metrics.json                      # top-level cross-tier metrics summary
@@ -109,14 +108,33 @@ exception is `total_touches_all`, the leakage trap — flagged
 `leakage_risk=True` in `feature_dictionary.csv`. Drop it from your
 feature set unless you're demonstrating leakage detection.
 
+## Evaluation note — account and contact overlap
+
+**518 of 557 test accounts (≈93 %) appear in train** on the intermediate
+bundle; the other tiers are similar. Contact-level overlap is comparable
+in magnitude: most test contacts also have activity in the training set.
+The random-split headline metrics therefore ride both account-level and
+contact-level signal across the split boundary and over-estimate
+generalisation to unseen accounts and contacts. For a faithful
+out-of-sample number, retrain with `GroupKFold(account_id)` and report
+both metrics. Notebook 02 demonstrates the detection recipe;
+[`break_me_guide.md`](../docs/release/break_me_guide.md) §5 gives
+the worked example.
+
 ## Dataset summary
 
+**Tiers are prevalence and noise axes, not modelling-complexity axes.**
+LR AUC is ~0.88 in every tier by design. The tiers differ in conversion
+rate, missingness, and noise — not rank discrimination. Choose a tier
+based on the teaching exercise, not on expected AUC:
+
 | | Intro | Intermediate | Advanced |
 |---|---|---|---|
+| **Tier purpose** | High-prevalence warm-up | Default benchmark | Low-prevalence · calibration · noise exercise |
 | Leads | 5,000 | 5,000 | 5,000 |
 | Accounts | 1,500 | 1,500 | 1,500 |
 | Contacts | 4,200 | 4,200 | 4,200 |
-| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |
+| Snapshot columns | 31 / 34* | 31 / 34* | 31 / 34* |
 | Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |
 | Conversion rate (acceptance band, gate G7.\*) | 24–61% | 12–31% | 4–12% |
 | Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |
@@ -178,14 +196,22 @@ with bands declared in
 [`docs/release/v1_acceptance_gates_bands.yaml`](../docs/release/v1_acceptance_gates_bands.yaml).
 Headline cross-seed medians (seeds 42–46):
 
-| Tier | LR AUC | AP | P@100 | Brier |
-|---|---|---|---|---|
-| intro | 0.879 | 0.761 | 0.80 | 0.130 |
-| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |
-| advanced | 0.886 | 0.351 | 0.34 | 0.061 |
+| Tier | LR AUC | AP | P@100 | Brier | `calibration_max_bin_error` |
+|---|---|---|---|---|---|
+| intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |
+| intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |
+| advanced | 0.886 | 0.351 | 0.34 | 0.061 | **0.52** |
+
+**Reading this table:** LR AUC is flat across tiers by design — the
+tiers are a prevalence / noise axis, not a rank-discrimination axis.
+Brier score *improves* as prevalence falls (a prevalence effect, not
+better calibration); use `calibration_max_bin_error` to assess
+calibration quality. Advanced's 0.52 max-bin error means the model's
+predicted probabilities are materially mis-scaled against actual
+conversion rates — a realistic miscalibration exercise.
 
 AP, P@100, conversion-rate, and lift orderings hold across the
-intended difficulty axis (intro > intermediate > advanced).
+intended prevalence axis (intro > intermediate > advanced).
 
 ## Intended uses
 
@@ -211,9 +237,16 @@ intended difficulty axis (intro > intermediate > advanced).
 
 ## Known limitations
 
-- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across
-  every tier. Difficulty is visible in AP, P@K, Brier, and value
-  capture. Treat AUC as a sanity check, not a difficulty signal.
+- **Tiers are a prevalence / noise axis, not a modelling-complexity
+  axis.** LR AUC is ~0.88 in every tier; the three tiers differ in
+  conversion rate (43% / 22% / 8%), noise scale, and missingness —
+  not in rank discrimination. Use AP, P@K, and calibration metrics
+  to see the difficulty gradient; AUC alone will not show it.
+- **93% account and contact overlap across train / test splits.** Random
+  splits are keyed on lead ID; most test accounts and contacts also
+  appear in train. Headline metrics over-state generalisation to unseen
+  accounts and contacts. Use `GroupKFold(account_id)` for a faithful
+  estimate.
 - **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta
   is slightly negative in every tier (intro −0.0045, intermediate
   −0.0072, advanced −0.0133); v1's snapshot is dominated by linear
@@ -227,30 +260,31 @@ intended difficulty axis (intro > intermediate > advanced).
 - **Cohort-shift degradation is small.** v1 has no time-of-year drift
   baked in; the cohort-shift gate (G6.4) is informational and will
   bite in v2.
+- **Advanced-tier noise can produce artifact zeros in count and duration
+  columns.** Gaussian noise is applied before MCAR missingness; the
+  snapshot builder clamps results below zero to zero. What users observe
+  is therefore not negative values but zeros that may be noise artifacts
+  rather than true zero values — e.g. `days_since_last_touch = 0` might
+  mean "noised below zero, clamped" rather than "touched today". Treat
+  suspicious zero clusters in the Advanced tier as intentional
+  data-cleaning exercise material.
 
 ## Composition
 
 - **Entities.** Accounts, contacts, leads, touches, sessions,
   sales_activities, opportunities (public); plus customers and
   subscriptions (instructor only). Per-row counts per bundle live in
   `manifest.json`.
-- **Features.** 32 public columns grouped by analytical role in
+- **Features.** 31 public columns grouped by analytical role in
   [`docs/release/feature_dictionary.md`](../docs/release/feature_dictionary.md);
   the per-bundle `feature_dictionary.csv` is the authoritative
   machine-readable spec.
 - **Label.** `converted_within_90_days` (boolean), event-derived from
   the simulator. Never sampled directly.
 - **Splits.** 70/15/15 train/valid/test, deterministic given seed;
   recorded in `tasks/converted_within_90_days/task_manifest.json`.
-  **Group-leakage warning:** the splitter is keyed on `lead_id` only,
-  not on `account_id` or `contact_id`. On the as-shipped intermediate
-  bundle, **518 of 557 test accounts (≈93 %) also appear in train**;
-  the contact-level overlap is similar in magnitude. A flat baseline
-  trained on the random split rides account-level signal across the
-  split boundary. For a generalisation-faithful number, retrain with
-  `GroupKFold(account_id)` (or `contact_id`) and report both — see
-  [`break_me_guide.md`](../docs/release/break_me_guide.md) §5 for the
-  detection recipe.
+  Splits are keyed on `lead_id`; see the *Evaluation note* above for
+  the account-overlap caveat.
 - **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package
   version stamped in `manifest.json`.