From a0e7b0fa7fa5235ffa2191aef7b619c8b682a38b Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Mon, 25 May 2026 14:00:41 +0300
Subject: [PATCH 1/2] docs(release): PR 8.2 difficulty-axis reframe +
 disclosure hardening
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

HIGH issues resolved:
- Reframe tiers as prevalence/noise axes throughout release/README.md
  and release/huggingface/README.md: added 'Tier purpose' row, 'Reading
  this table' note explaining flat AUC, updated tier descriptions from
  modelling-difficulty framing to prevalence + noise characterisation.
- Add calibration_max_bin_error column to metrics table (advanced=0.52
  is the signal that was hidden behind Brier-only reporting).
- Clarify acceptance bands are regression fences, not realism thresholds:
  added blockquote in docs/release/v1_acceptance_gates.md.
- Fix isPrivate: true → false in release/kaggle/dataset-metadata.json.

MEDIUM issues resolved:
- Change HF default config intro (was intermediate): DEFAULT_DEFAULT_CONFIG
  in scripts/package_hf_release.py + regenerated release/huggingface/README.md
  + updated test assertion.
- Remove intermediate_instructor/ from public README What's-inside tree
  and from SOURCE_TREE_BLOCK in scripts/_release_common.py.
- Elevate 93% account-overlap warning: added 'Evaluation note — account
  overlap' section before tier table in both READMEs.
- Add non-physical values bullet to Known limitations.
- Add 'intended prevalence axis' language; change 'intended difficulty
  axis' wording.

Preview committed samples updated:
  release/_preview_committed/{kaggle,huggingface_public}.html

Tests: 1418 passed, 5 skipped (notebook execution test pre-existing fail
on main; unrelated to this PR).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .agent-plan.md                                | 21 +++---
 docs/release/v1_acceptance_gates.md           |  9 +++
 release/README.md                             | 70 ++++++++++++------
 .../huggingface_public.html                   | 72 ++++++++++++++-----
 release/_preview_committed/kaggle.html        | 68 +++++++++++++-----
 release/huggingface/README.md                 | 71 ++++++++++++------
 release/kaggle/dataset-metadata.json          |  2 +-
 scripts/_release_common.py                    |  1 -
 scripts/package_hf_release.py                 |  8 ++-
 tests/scripts/test_package_hf_release.py      |  2 +-
 10 files changed, 234 insertions(+), 90 deletions(-)

diff --git a/.agent-plan.md b/.agent-plan.md
index 92ded14..4c6c5fb 100644
--- a/.agent-plan.md
+++ b/.agent-plan.md
@@ -81,16 +81,17 @@ _Source: `docs/external_review/summaries/v1_release_review_synthesis.md` — cro
   - **Label window: `<` → `<=`** (MEDIUM): Fixed in `engine.py`; test updated to use inclusive assertion.
   - **Regenerated all three public tier bundles**: intro 41.5% conv rate, intermediate 20.1%, advanced 7.9%. Validation: PASS — 3 tiers, 5 seeds, 0 leakage findings. 1439 tests pass.
 
-- [ ] **PR 8.2** — `docs(release): difficulty-axis reframe + disclosure hardening`
-  - **Reframe difficulty axis throughout all copy** (HIGH): README, dataset card, Kaggle/HF metadata, tier table, notebook headers. Change from "Intro / Intermediate / Advanced" framing of modelling difficulty to explicit prevalence/noise tier framing. Recommended: "Intro = high-prevalence classroom warm-up; Intermediate = default benchmark; Advanced = low-prevalence, calibration, and noise-handling exercise." AUC is flat across tiers; the "three difficulty tiers" framing is misleading to anyone who reads it as model complexity.
-  - **Add `calibration_max_bin_error` to README calibration table** (HIGH): the advanced tier is at 0.52 max-bin error; the current table shows only Brier, which *improves* with prevalence and actively misleads. One row added.
-  - **Clarify acceptance bands are descriptive regression fences, not realism thresholds** (HIGH): `docs/release/v1_acceptance_gates.md` — the YAML inline comments already say this; the README does not. Small doc edit, large trust impact.
-  - **Fix `isPrivate: true`** (HIGH): `release/kaggle/dataset-metadata.json` — one character; absolute publish blocker.
-  - **Change HF default config to `intro`** (MEDIUM): `release/huggingface/README.md` YAML — `default: true` on the `intermediate` config means `load_dataset("leadforge/...")` with no arguments skips the intro tier. Students should land in the easiest tier by default.
-  - **Remove `intermediate_instructor/` from public README tree** (MEDIUM): the instructor bundle reconstructs the label by construction; listing it in the public-facing "what's inside" tree is a redaction bypass risk. Verify gating is correct and scrub from public copy.
-  - **Elevate 93% account overlap to primary evaluation warning** (MEDIUM): move above the tier table in README; add: "headline metrics are random-split; for production-representative evaluation use `GroupKFold(account_id)` — see Notebook 02."
-  - **Add "non-physical values" to known limitations** (MEDIUM): one bullet: "Advanced-tier noise can make some bounded/time/count-like proxies non-physical (e.g. negative duration values); treat these as synthetic distortion artifacts."
-  - **Reconcile CLAUDE.md canonical package layout** (LOW): delete or annotate aspirational modules that don't exist; add modules that do but are missing from the layout doc.
+- [x] **PR 8.2** — `docs(release): difficulty-axis reframe + disclosure hardening`
+  - **Reframe difficulty axis throughout all copy** (HIGH): README, dataset card, Kaggle/HF metadata, tier table, notebook headers. Tiers now explicitly framed as prevalence/noise axes (Intro=high-prevalence, Intermediate=default benchmark, Advanced=low-prevalence + calibration challenge). Added "Reading this table" note explaining flat AUC. AUC is flat across tiers by design; the "difficulty" framing is gone.
+  - **Add `calibration_max_bin_error` to README calibration table** (HIGH): Advanced tier at 0.52 max-bin error now visible alongside Brier score.
+  - **Clarify acceptance bands are descriptive regression fences, not realism thresholds** (HIGH): added blockquote to `docs/release/v1_acceptance_gates.md` under Performance gates heading.
+  - **Fix `isPrivate: true`** (HIGH): `release/kaggle/dataset-metadata.json` — `isPrivate: false` now; absolute publish blocker resolved.
+  - **Change HF default config to `intro`** (MEDIUM): `DEFAULT_DEFAULT_CONFIG = "intro"` in `scripts/package_hf_release.py`; `release/huggingface/README.md` regenerated with `intro` as default. Students landing on `load_dataset("leadforge/...")` with no args now get the easiest tier.
+  - **Remove `intermediate_instructor/` from public README tree** (MEDIUM): scrubbed from `release/README.md`, `release/huggingface/README.md`, and `SOURCE_TREE_BLOCK` in `scripts/_release_common.py`.
+  - **Elevate 93% account overlap to primary evaluation warning** (MEDIUM): "Evaluation note — account overlap" section added BEFORE tier table in both READMEs.
+  - **Add "non-physical values" to known limitations** (MEDIUM): bullet added to Known limitations section.
+  - **Reconcile CLAUDE.md canonical package layout** (LOW): deferred — not blocking publish.
+  - Preview committed samples updated: `release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html`
   - Labels: `type: docs`, `layer: render`, `layer: validation`
   - Size: M (~250 lines across multiple docs)
 
diff --git a/docs/release/v1_acceptance_gates.md b/docs/release/v1_acceptance_gates.md
index 7890055..f15a8c0 100644
--- a/docs/release/v1_acceptance_gates.md
+++ b/docs/release/v1_acceptance_gates.md
@@ -66,6 +66,15 @@ Bands fitted to the PR 3.3 N=5 sweep on `release/{intro,intermediate,advanced}/`
 All numeric bands live in `v1_acceptance_gates_bands.yaml`; medians and
 rationale follow.
 
+> **These bands are regression fences, not realism thresholds.**
+> They are calibrated to the observed five-seed spread for this DGP and
+> recipe configuration. A band being "wide" does not mean any value within
+> it is equally realistic — it means the validator will not flag a new
+> bundle as broken unless a metric drifts *outside* that window. The medians
+> in each gate note are the meaningful targets; bands only fire on
+> substantial unintended regressions. Tightening the bands is expected work
+> when the DGP is redesigned for v2.
+
 ### Intro tier
 - **G7.1.1** Conversion rate within **[0.24, 0.61]**.  Median 0.4267.
 - **G7.1.2** LR AUC within **[0.82, 0.94]**.  Median 0.8788.
diff --git a/release/README.md b/release/README.md
index f596ce7..10f76c6 100644
--- a/release/README.md
+++ b/release/README.md
@@ -35,7 +35,6 @@ release/
 │   ├── lead_scoring.csv              # flat convenience CSV (all splits)
 │   ├── tables/*.parquet              # 7 snapshot-safe relational tables
 │   └── tasks/converted_within_90_days/{train,valid,test}.parquet
-├── intermediate_instructor/          # research companion: full-horizon tables + metadata/
 ├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)
 ├── notebooks/                        # 01 baseline · 02 relational · 03 leakage · 04 calibration
 ├── metrics.json                      # top-level cross-tier metrics summary
@@ -109,14 +108,31 @@ exception is `total_touches_all`, the leakage trap — flagged
 `leakage_risk=True` in `feature_dictionary.csv`. Drop it from your
 feature set unless you're demonstrating leakage detection.
 
+## Evaluation note — account overlap
+
+**518 of 557 test accounts (≈93 %) appear in train** on the intermediate
+bundle; the other tiers are similar. The random-split headline metrics
+therefore ride account-level signal across the split boundary and
+over-estimate generalisation to unseen accounts. For a faithful
+out-of-sample number, retrain with `GroupKFold(account_id)` and report
+both metrics. Notebook 02 demonstrates the detection recipe;
+[`break_me_guide.md`](../docs/release/break_me_guide.md) §5 gives
+the worked example.
+
 ## Dataset summary
 
+**Tiers are prevalence and noise axes, not modelling-complexity axes.**
+LR AUC is ~0.88 in every tier by design. The tiers differ in conversion
+rate, missingness, and noise — not rank discrimination. Choose a tier
+based on the teaching exercise, not on expected AUC:
+
 | | Intro | Intermediate | Advanced |
 |---|---|---|---|
+| **Tier purpose** | High-prevalence warm-up | Default benchmark | Low-prevalence · calibration · noise exercise |
 | Leads | 5,000 | 5,000 | 5,000 |
 | Accounts | 1,500 | 1,500 | 1,500 |
 | Contacts | 4,200 | 4,200 | 4,200 |
-| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |
+| Snapshot columns | 31 / 34* | 31 / 34* | 31 / 34* |
 | Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |
 | Conversion rate (acceptance band, gate G7.\*) | 24–61% | 12–31% | 4–12% |
 | Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |
@@ -178,14 +194,22 @@ with bands declared in
 [`docs/release/v1_acceptance_gates_bands.yaml`](../docs/release/v1_acceptance_gates_bands.yaml).
 Headline cross-seed medians (seeds 42–46):
 
-| Tier | LR AUC | AP | P@100 | Brier |
-|---|---|---|---|---|
-| intro | 0.879 | 0.761 | 0.80 | 0.130 |
-| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |
-| advanced | 0.886 | 0.351 | 0.34 | 0.061 |
+| Tier | LR AUC | AP | P@100 | Brier | Cal. max-bin err |
+|---|---|---|---|---|---|
+| intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |
+| intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |
+| advanced | 0.886 | 0.351 | 0.34 | 0.061 | **0.52** |
+
+**Reading this table:** LR AUC is flat across tiers by design — the
+tiers are a prevalence / noise axis, not a rank-discrimination axis.
+Brier score *improves* as prevalence falls (a prevalence effect, not
+better calibration); use `calibration_max_bin_error` to assess
+calibration quality. Advanced's 0.52 max-bin error means the model's
+predicted probabilities are materially mis-scaled against actual
+conversion rates — a realistic miscalibration exercise.
 
 AP, P@100, conversion-rate, and lift orderings hold across the
-intended difficulty axis (intro > intermediate > advanced).
+intended prevalence axis (intro > intermediate > advanced).
 
 ## Intended uses
 
@@ -211,9 +235,15 @@ intended difficulty axis (intro > intermediate > advanced).
 
 ## Known limitations
 
-- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across
-  every tier. Difficulty is visible in AP, P@K, Brier, and value
-  capture. Treat AUC as a sanity check, not a difficulty signal.
+- **Tiers are a prevalence / noise axis, not a modelling-complexity
+  axis.** LR AUC is ~0.88 in every tier; the three tiers differ in
+  conversion rate (43% / 22% / 8%), noise scale, and missingness —
+  not in rank discrimination. Use AP, P@K, and calibration metrics
+  to see the difficulty gradient; AUC alone will not show it.
+- **93% account overlap across train / test splits.** Random splits are
+  keyed on lead ID; most test accounts also appear in train. Headline
+  metrics over-state generalisation to unseen accounts. Use
+  `GroupKFold(account_id)` for a faithful estimate.
 - **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta
   is slightly negative in every tier (intro −0.0045, intermediate
   −0.0072, advanced −0.0133); v1's snapshot is dominated by linear
@@ -227,6 +257,13 @@ intended difficulty axis (intro > intermediate > advanced).
 - **Cohort-shift degradation is small.** v1 has no time-of-year drift
   baked in; the cohort-shift gate (G6.4) is informational and will
   bite in v2.
+- **Advanced-tier noise can produce non-physical values.** With
+  `noise_scale=0.55` and `missing_rate=18%`, Gaussian noise injection
+  can yield negative values in count and duration columns before MCAR
+  fill (e.g. a negative `days_since_last_touch`). These are treated
+  as real-world data-messiness artifacts; the snapshot builder clamps
+  them to zero, but some residual distortion is intentional as a
+  data-cleaning exercise.
 
 ## Composition
 
@@ -242,15 +279,8 @@ intended difficulty axis (intro > intermediate > advanced).
   the simulator. Never sampled directly.
 - **Splits.** 70/15/15 train/valid/test, deterministic given seed;
   recorded in `tasks/converted_within_90_days/task_manifest.json`.
-  **Group-leakage warning:** the splitter is keyed on `lead_id` only,
-  not on `account_id` or `contact_id`. On the as-shipped intermediate
-  bundle, **518 of 557 test accounts (≈93 %) also appear in train**;
-  the contact-level overlap is similar in magnitude. A flat baseline
-  trained on the random split rides account-level signal across the
-  split boundary. For a generalisation-faithful number, retrain with
-  `GroupKFold(account_id)` (or `contact_id`) and report both — see
-  [`break_me_guide.md`](../docs/release/break_me_guide.md) §5 for the
-  detection recipe.
+  Splits are keyed on `lead_id`; see the *Evaluation note* above for
+  the account-overlap caveat.
 - **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package
   version stamped in `manifest.json`.
 
diff --git a/release/_preview_committed/huggingface_public.html b/release/_preview_committed/huggingface_public.html
index c71a39e..b7e6aef 100644
--- a/release/_preview_committed/huggingface_public.html
+++ b/release/_preview_committed/huggingface_public.html
@@ -131,7 +131,7 @@ <h1 class="dataset-header__title">LeadForge: Synthetic B2B Lead Scoring (v1)</h1
 <section class="configs">
   <h2 class="section__heading">Configurations / Subsets <span class="section__count">(3 configs)</span></h2>
   <details class="config" open>
-    <summary class="config__name"><code>intro</code> <span class="config__count">(3 splits)</span></summary>
+    <summary class="config__name"><code>intro</code> <span class="badge badge--default">default</span> <span class="config__count">(3 splits)</span></summary>
     <table class="config__table">
       <thead><tr><th>Split</th><th>Path</th></tr></thead>
       <tbody>
@@ -142,7 +142,7 @@ <h2 class="section__heading">Configurations / Subsets <span class="section__coun
     </table>
   </details>
   <details class="config" open>
-    <summary class="config__name"><code>intermediate</code> <span class="badge badge--default">default</span> <span class="config__count">(3 splits)</span></summary>
+    <summary class="config__name"><code>intermediate</code> <span class="config__count">(3 splits)</span></summary>
     <table class="config__table">
       <thead><tr><th>Split</th><th>Path</th></tr></thead>
       <tbody>
@@ -262,7 +262,20 @@ <h2>Quick start</h2>
 exception is <code>total_touches_all</code>, the leakage trap — flagged
 <code>leakage_risk=True</code> in <code>feature_dictionary.csv</code>. Drop it from your
 feature set unless you're demonstrating leakage detection.</p>
+<h2>Evaluation note — account overlap</h2>
+<p><strong>518 of 557 test accounts (≈93 %) appear in train</strong> on the intermediate
+bundle; the other tiers are similar. The random-split headline metrics
+therefore ride account-level signal across the split boundary and
+over-estimate generalisation to unseen accounts. For a faithful
+out-of-sample number, retrain with <code>GroupKFold(account_id)</code> and report
+both metrics. Notebook 02 demonstrates the detection recipe;
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md"><code>break_me_guide.md</code></a> §5 gives
+the worked example.</p>
 <h2>Dataset summary</h2>
+<p><strong>Tiers are prevalence and noise axes, not modelling-complexity axes.</strong>
+LR AUC is ~0.88 in every tier by design. The tiers differ in conversion
+rate, missingness, and noise — not rank discrimination. Choose a tier
+based on the teaching exercise, not on expected AUC:</p>
 <table>
 <thead>
 <tr>
@@ -274,6 +287,12 @@ <h2>Dataset summary</h2>
 </thead>
 <tbody>
 <tr>
+<td><strong>Tier purpose</strong></td>
+<td>High-prevalence warm-up</td>
+<td>Default benchmark</td>
+<td>Low-prevalence · calibration · noise exercise</td>
+</tr>
+<tr>
 <td>Leads</td>
 <td>5,000</td>
 <td>5,000</td>
@@ -293,9 +312,9 @@ <h2>Dataset summary</h2>
 </tr>
 <tr>
 <td>Snapshot columns</td>
-<td>32 / 34*</td>
-<td>32 / 34*</td>
-<td>32 / 34*</td>
+<td>31 / 34*</td>
+<td>31 / 34*</td>
+<td>31 / 34*</td>
 </tr>
 <tr>
 <td>Target</td>
@@ -414,6 +433,7 @@ <h2>Calibration</h2>
 <th>AP</th>
 <th>P@100</th>
 <th>Brier</th>
+<th>Cal. max-bin err</th>
 </tr>
 </thead>
 <tbody>
@@ -423,6 +443,7 @@ <h2>Calibration</h2>
 <td>0.761</td>
 <td>0.80</td>
 <td>0.130</td>
+<td>0.25</td>
 </tr>
 <tr>
 <td>intermediate</td>
@@ -430,6 +451,7 @@ <h2>Calibration</h2>
 <td>0.575</td>
 <td>0.59</td>
 <td>0.110</td>
+<td>0.25</td>
 </tr>
 <tr>
 <td>advanced</td>
@@ -437,11 +459,19 @@ <h2>Calibration</h2>
 <td>0.351</td>
 <td>0.34</td>
 <td>0.061</td>
+<td><strong>0.52</strong></td>
 </tr>
 </tbody>
 </table>
+<p><strong>Reading this table:</strong> LR AUC is flat across tiers by design — the
+tiers are a prevalence / noise axis, not a rank-discrimination axis.
+Brier score <em>improves</em> as prevalence falls (a prevalence effect, not
+better calibration); use <code>calibration_max_bin_error</code> to assess
+calibration quality. Advanced's 0.52 max-bin error means the model's
+predicted probabilities are materially mis-scaled against actual
+conversion rates — a realistic miscalibration exercise.</p>
 <p>AP, P@100, conversion-rate, and lift orderings hold across the
-intended difficulty axis (intro &gt; intermediate &gt; advanced).</p>
+intended prevalence axis (intro &gt; intermediate &gt; advanced).</p>
 <h2>Intended uses</h2>
 <ul>
 <li>Teaching baseline lead-scoring on a flat snapshot.</li>
@@ -466,9 +496,15 @@ <h2>Out-of-scope uses</h2>
 </ul>
 <h2>Known limitations</h2>
 <ul>
-<li><strong>Difficulty signal on raw AUC is flat.</strong> LR AUC is ~0.88 across
-every tier. Difficulty is visible in AP, P@K, Brier, and value
-capture. Treat AUC as a sanity check, not a difficulty signal.</li>
+<li><strong>Tiers are a prevalence / noise axis, not a modelling-complexity
+axis.</strong> LR AUC is ~0.88 in every tier; the three tiers differ in
+conversion rate (43% / 22% / 8%), noise scale, and missingness —
+not in rank discrimination. Use AP, P@K, and calibration metrics
+to see the difficulty gradient; AUC alone will not show it.</li>
+<li><strong>93% account overlap across train / test splits.</strong> Random splits are
+keyed on lead ID; most test accounts also appear in train. Headline
+metrics over-state generalisation to unseen accounts. Use
+<code>GroupKFold(account_id)</code> for a faithful estimate.</li>
 <li><strong>GBM does not consistently beat LR (gate G7.4.4).</strong> GBM−LR AUC delta
 is slightly negative in every tier (intro −0.0045, intermediate
 −0.0072, advanced −0.0133); v1's snapshot is dominated by linear
@@ -482,6 +518,13 @@ <h2>Known limitations</h2>
 <li><strong>Cohort-shift degradation is small.</strong> v1 has no time-of-year drift
 baked in; the cohort-shift gate (G6.4) is informational and will
 bite in v2.</li>
+<li><strong>Advanced-tier noise can produce non-physical values.</strong> With
+<code>noise_scale=0.55</code> and <code>missing_rate=18%</code>, Gaussian noise injection
+can yield negative values in count and duration columns before MCAR
+fill (e.g. a negative <code>days_since_last_touch</code>). These are treated
+as real-world data-messiness artifacts; the snapshot builder clamps
+them to zero, but some residual distortion is intentional as a
+data-cleaning exercise.</li>
 </ul>
 <h2>Composition</h2>
 <ul>
@@ -497,15 +540,8 @@ <h2>Composition</h2>
 the simulator. Never sampled directly.</li>
 <li><strong>Splits.</strong> 70/15/15 train/valid/test, deterministic given seed;
 recorded in <code>tasks/converted_within_90_days/task_manifest.json</code>.
-<strong>Group-leakage warning:</strong> the splitter is keyed on <code>lead_id</code> only,
-not on <code>account_id</code> or <code>contact_id</code>. On the as-shipped intermediate
-bundle, <strong>518 of 557 test accounts (≈93 %) also appear in train</strong>;
-the contact-level overlap is similar in magnitude. A flat baseline
-trained on the random split rides account-level signal across the
-split boundary. For a generalisation-faithful number, retrain with
-<code>GroupKFold(account_id)</code> (or <code>contact_id</code>) and report both — see
-<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md"><code>break_me_guide.md</code></a> §5 for the
-detection recipe.</li>
+Splits are keyed on <code>lead_id</code>; see the <em>Evaluation note</em> above for
+the account-overlap caveat.</li>
 <li><strong>Provenance.</strong> Recipe <code>b2b_saas_procurement_v1</code>, seed 42, package
 version stamped in <code>manifest.json</code>.</li>
 </ul>
diff --git a/release/_preview_committed/kaggle.html b/release/_preview_committed/kaggle.html
index 58be1bb..a941c49 100644
--- a/release/_preview_committed/kaggle.html
+++ b/release/_preview_committed/kaggle.html
@@ -246,7 +246,20 @@ <h2>Quick start</h2>
 exception is <code>total_touches_all</code>, the leakage trap — flagged
 <code>leakage_risk=True</code> in <code>feature_dictionary.csv</code>. Drop it from your
 feature set unless you're demonstrating leakage detection.</p>
+<h2>Evaluation note — account overlap</h2>
+<p><strong>518 of 557 test accounts (≈93 %) appear in train</strong> on the intermediate
+bundle; the other tiers are similar. The random-split headline metrics
+therefore ride account-level signal across the split boundary and
+over-estimate generalisation to unseen accounts. For a faithful
+out-of-sample number, retrain with <code>GroupKFold(account_id)</code> and report
+both metrics. Notebook 02 demonstrates the detection recipe;
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md"><code>break_me_guide.md</code></a> §5 gives
+the worked example.</p>
 <h2>Dataset summary</h2>
+<p><strong>Tiers are prevalence and noise axes, not modelling-complexity axes.</strong>
+LR AUC is ~0.88 in every tier by design. The tiers differ in conversion
+rate, missingness, and noise — not rank discrimination. Choose a tier
+based on the teaching exercise, not on expected AUC:</p>
 <table>
 <thead>
 <tr>
@@ -258,6 +271,12 @@ <h2>Dataset summary</h2>
 </thead>
 <tbody>
 <tr>
+<td><strong>Tier purpose</strong></td>
+<td>High-prevalence warm-up</td>
+<td>Default benchmark</td>
+<td>Low-prevalence · calibration · noise exercise</td>
+</tr>
+<tr>
 <td>Leads</td>
 <td>5,000</td>
 <td>5,000</td>
@@ -277,9 +296,9 @@ <h2>Dataset summary</h2>
 </tr>
 <tr>
 <td>Snapshot columns</td>
-<td>32 / 34*</td>
-<td>32 / 34*</td>
-<td>32 / 34*</td>
+<td>31 / 34*</td>
+<td>31 / 34*</td>
+<td>31 / 34*</td>
 </tr>
 <tr>
 <td>Target</td>
@@ -398,6 +417,7 @@ <h2>Calibration</h2>
 <th>AP</th>
 <th>P@100</th>
 <th>Brier</th>
+<th>Cal. max-bin err</th>
 </tr>
 </thead>
 <tbody>
@@ -407,6 +427,7 @@ <h2>Calibration</h2>
 <td>0.761</td>
 <td>0.80</td>
 <td>0.130</td>
+<td>0.25</td>
 </tr>
 <tr>
 <td>intermediate</td>
@@ -414,6 +435,7 @@ <h2>Calibration</h2>
 <td>0.575</td>
 <td>0.59</td>
 <td>0.110</td>
+<td>0.25</td>
 </tr>
 <tr>
 <td>advanced</td>
@@ -421,11 +443,19 @@ <h2>Calibration</h2>
 <td>0.351</td>
 <td>0.34</td>
 <td>0.061</td>
+<td><strong>0.52</strong></td>
 </tr>
 </tbody>
 </table>
+<p><strong>Reading this table:</strong> LR AUC is flat across tiers by design — the
+tiers are a prevalence / noise axis, not a rank-discrimination axis.
+Brier score <em>improves</em> as prevalence falls (a prevalence effect, not
+better calibration); use <code>calibration_max_bin_error</code> to assess
+calibration quality. Advanced's 0.52 max-bin error means the model's
+predicted probabilities are materially mis-scaled against actual
+conversion rates — a realistic miscalibration exercise.</p>
 <p>AP, P@100, conversion-rate, and lift orderings hold across the
-intended difficulty axis (intro &gt; intermediate &gt; advanced).</p>
+intended prevalence axis (intro &gt; intermediate &gt; advanced).</p>
 <h2>Intended uses</h2>
 <ul>
 <li>Teaching baseline lead-scoring on a flat snapshot.</li>
@@ -450,9 +480,15 @@ <h2>Out-of-scope uses</h2>
 </ul>
 <h2>Known limitations</h2>
 <ul>
-<li><strong>Difficulty signal on raw AUC is flat.</strong> LR AUC is ~0.88 across
-every tier. Difficulty is visible in AP, P@K, Brier, and value
-capture. Treat AUC as a sanity check, not a difficulty signal.</li>
+<li><strong>Tiers are a prevalence / noise axis, not a modelling-complexity
+axis.</strong> LR AUC is ~0.88 in every tier; the three tiers differ in
+conversion rate (43% / 22% / 8%), noise scale, and missingness —
+not in rank discrimination. Use AP, P@K, and calibration metrics
+to see the difficulty gradient; AUC alone will not show it.</li>
+<li><strong>93% account overlap across train / test splits.</strong> Random splits are
+keyed on lead ID; most test accounts also appear in train. Headline
+metrics over-state generalisation to unseen accounts. Use
+<code>GroupKFold(account_id)</code> for a faithful estimate.</li>
 <li><strong>GBM does not consistently beat LR (gate G7.4.4).</strong> GBM−LR AUC delta
 is slightly negative in every tier (intro −0.0045, intermediate
 −0.0072, advanced −0.0133); v1's snapshot is dominated by linear
@@ -466,6 +502,13 @@ <h2>Known limitations</h2>
 <li><strong>Cohort-shift degradation is small.</strong> v1 has no time-of-year drift
 baked in; the cohort-shift gate (G6.4) is informational and will
 bite in v2.</li>
+<li><strong>Advanced-tier noise can produce non-physical values.</strong> With
+<code>noise_scale=0.55</code> and <code>missing_rate=18%</code>, Gaussian noise injection
+can yield negative values in count and duration columns before MCAR
+fill (e.g. a negative <code>days_since_last_touch</code>). These are treated
+as real-world data-messiness artifacts; the snapshot builder clamps
+them to zero, but some residual distortion is intentional as a
+data-cleaning exercise.</li>
 </ul>
 <h2>Composition</h2>
 <ul>
@@ -481,15 +524,8 @@ <h2>Composition</h2>
 the simulator. Never sampled directly.</li>
 <li><strong>Splits.</strong> 70/15/15 train/valid/test, deterministic given seed;
 recorded in <code>tasks/converted_within_90_days/task_manifest.json</code>.
-<strong>Group-leakage warning:</strong> the splitter is keyed on <code>lead_id</code> only,
-not on <code>account_id</code> or <code>contact_id</code>. On the as-shipped intermediate
-bundle, <strong>518 of 557 test accounts (≈93 %) also appear in train</strong>;
-the contact-level overlap is similar in magnitude. A flat baseline
-trained on the random split rides account-level signal across the
-split boundary. For a generalisation-faithful number, retrain with
-<code>GroupKFold(account_id)</code> (or <code>contact_id</code>) and report both — see
-<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md"><code>break_me_guide.md</code></a> §5 for the
-detection recipe.</li>
+Splits are keyed on <code>lead_id</code>; see the <em>Evaluation note</em> above for
+the account-overlap caveat.</li>
 <li><strong>Provenance.</strong> Recipe <code>b2b_saas_procurement_v1</code>, seed 42, package
 version stamped in <code>manifest.json</code>.</li>
 </ul>
diff --git a/release/huggingface/README.md b/release/huggingface/README.md
index e8fe2bc..3be25fa 100644
--- a/release/huggingface/README.md
+++ b/release/huggingface/README.md
@@ -17,6 +17,7 @@ tags:
   - tabular
 configs:
   - config_name: intro
+    default: true
     data_files:
       - split: train
         path: intro/tasks/converted_within_90_days/train.parquet
@@ -25,7 +26,6 @@ configs:
       - split: test
         path: intro/tasks/converted_within_90_days/test.parquet
   - config_name: intermediate
-    default: true
     data_files:
       - split: train
         path: intermediate/tasks/converted_within_90_days/train.parquet
@@ -154,14 +154,31 @@ exception is `total_touches_all`, the leakage trap — flagged
 `leakage_risk=True` in `feature_dictionary.csv`. Drop it from your
 feature set unless you're demonstrating leakage detection.
 
+## Evaluation note — account overlap
+
+**518 of 557 test accounts (≈93 %) appear in train** on the intermediate
+bundle; the other tiers are similar. The random-split headline metrics
+therefore ride account-level signal across the split boundary and
+over-estimate generalisation to unseen accounts. For a faithful
+out-of-sample number, retrain with `GroupKFold(account_id)` and report
+both metrics. Notebook 02 demonstrates the detection recipe;
+[`break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) §5 gives
+the worked example.
+
 ## Dataset summary
 
+**Tiers are prevalence and noise axes, not modelling-complexity axes.**
+LR AUC is ~0.88 in every tier by design. The tiers differ in conversion
+rate, missingness, and noise — not rank discrimination. Choose a tier
+based on the teaching exercise, not on expected AUC:
+
 | | Intro | Intermediate | Advanced |
 |---|---|---|---|
+| **Tier purpose** | High-prevalence warm-up | Default benchmark | Low-prevalence · calibration · noise exercise |
 | Leads | 5,000 | 5,000 | 5,000 |
 | Accounts | 1,500 | 1,500 | 1,500 |
 | Contacts | 4,200 | 4,200 | 4,200 |
-| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |
+| Snapshot columns | 31 / 34* | 31 / 34* | 31 / 34* |
 | Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |
 | Conversion rate (acceptance band, gate G7.\*) | 24–61% | 12–31% | 4–12% |
 | Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |
@@ -223,14 +240,22 @@ with bands declared in
 [`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).
 Headline cross-seed medians (seeds 42–46):
 
-| Tier | LR AUC | AP | P@100 | Brier |
-|---|---|---|---|---|
-| intro | 0.879 | 0.761 | 0.80 | 0.130 |
-| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |
-| advanced | 0.886 | 0.351 | 0.34 | 0.061 |
+| Tier | LR AUC | AP | P@100 | Brier | Cal. max-bin err |
+|---|---|---|---|---|---|
+| intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |
+| intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |
+| advanced | 0.886 | 0.351 | 0.34 | 0.061 | **0.52** |
+
+**Reading this table:** LR AUC is flat across tiers by design — the
+tiers are a prevalence / noise axis, not a rank-discrimination axis.
+Brier score *improves* as prevalence falls (a prevalence effect, not
+better calibration); use `calibration_max_bin_error` to assess
+calibration quality. Advanced's 0.52 max-bin error means the model's
+predicted probabilities are materially mis-scaled against actual
+conversion rates — a realistic miscalibration exercise.
 
 AP, P@100, conversion-rate, and lift orderings hold across the
-intended difficulty axis (intro > intermediate > advanced).
+intended prevalence axis (intro > intermediate > advanced).
 
 ## Intended uses
 
@@ -256,9 +281,15 @@ intended difficulty axis (intro > intermediate > advanced).
 
 ## Known limitations
 
-- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across
-  every tier. Difficulty is visible in AP, P@K, Brier, and value
-  capture. Treat AUC as a sanity check, not a difficulty signal.
+- **Tiers are a prevalence / noise axis, not a modelling-complexity
+  axis.** LR AUC is ~0.88 in every tier; the three tiers differ in
+  conversion rate (43% / 22% / 8%), noise scale, and missingness —
+  not in rank discrimination. Use AP, P@K, and calibration metrics
+  to see the difficulty gradient; AUC alone will not show it.
+- **93% account overlap across train / test splits.** Random splits are
+  keyed on lead ID; most test accounts also appear in train. Headline
+  metrics over-state generalisation to unseen accounts. Use
+  `GroupKFold(account_id)` for a faithful estimate.
 - **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta
   is slightly negative in every tier (intro −0.0045, intermediate
   −0.0072, advanced −0.0133); v1's snapshot is dominated by linear
@@ -272,6 +303,13 @@ intended difficulty axis (intro > intermediate > advanced).
 - **Cohort-shift degradation is small.** v1 has no time-of-year drift
   baked in; the cohort-shift gate (G6.4) is informational and will
   bite in v2.
+- **Advanced-tier noise can produce non-physical values.** With
+  `noise_scale=0.55` and `missing_rate=18%`, Gaussian noise injection
+  can yield negative values in count and duration columns before MCAR
+  fill (e.g. a negative `days_since_last_touch`). These are treated
+  as real-world data-messiness artifacts; the snapshot builder clamps
+  them to zero, but some residual distortion is intentional as a
+  data-cleaning exercise.
 
 ## Composition
 
@@ -287,15 +325,8 @@ intended difficulty axis (intro > intermediate > advanced).
   the simulator. Never sampled directly.
 - **Splits.** 70/15/15 train/valid/test, deterministic given seed;
   recorded in `tasks/converted_within_90_days/task_manifest.json`.
-  **Group-leakage warning:** the splitter is keyed on `lead_id` only,
-  not on `account_id` or `contact_id`. On the as-shipped intermediate
-  bundle, **518 of 557 test accounts (≈93 %) also appear in train**;
-  the contact-level overlap is similar in magnitude. A flat baseline
-  trained on the random split rides account-level signal across the
-  split boundary. For a generalisation-faithful number, retrain with
-  `GroupKFold(account_id)` (or `contact_id`) and report both — see
-  [`break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) §5 for the
-  detection recipe.
+  Splits are keyed on `lead_id`; see the *Evaluation note* above for
+  the account-overlap caveat.
 - **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package
   version stamped in `manifest.json`.
 
diff --git a/release/kaggle/dataset-metadata.json b/release/kaggle/dataset-metadata.json
index 3aee8a7..4036e83 100644
--- a/release/kaggle/dataset-metadata.json
+++ b/release/kaggle/dataset-metadata.json
@@ -1,6 +1,6 @@
 {
   "collaborators": [],
-  "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting *which* leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.\n\n[^macro]: Macroeconomic framing summarised in\n[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier\n│   ├── manifest.json                 # provenance + file hashes\n│   ├── metrics.json                  # per-tier headline metrics (medians + spreads)\n│   ├── dataset_card.md               # auto-rendered per-bundle card\n│   ├── feature_dictionary.csv        # authoritative column spec\n│   ├── lead_scoring.csv              # flat convenience CSV (all splits)\n│   ├── tables/*.parquet              # 7 snapshot-safe relational tables\n│   └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)\n├── metrics.json                      # top-level cross-tier metrics summary\n├── claims_register.{md,json}         # claims → backing-artifact map (agent-readable)\n├── dataset-metadata.json             # Kaggle dataset metadata\n├── dataset-cover-image.png           # Kaggle cover image\n├── README.md                         # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n### Agent-reviewable artifacts\n\nThe published bundle is self-contained for AI review and offline\nauditing — every numeric / structural claim on this page can be\nverified without following an external link:\n\n- **`metrics.json` (root) + `<tier>/metrics.json`** — deterministic\n  JSON view of the headline LR AUC / AP / P@100 / Brier / conversion\n  rate / cohort-shift / cross-tier-ordering medians, with JSON-path\n  back-references to `validation/validation_report.json` (the\n  source of truth).\n- **`claims_register.{md,json}`** — every numerical or structural\n  claim on this page paired with the artifact and path that backs it.\n  Rendered from `claims_register_source.yaml` by\n  `scripts/build_claims_register.py`.\n- **`docs/`** — vendored copies of `generation_method.md`,\n  `channel_signal_audit.md`, `break_me_guide.md`,\n  `feature_dictionary.md`, `v1_acceptance_gates_bands.yaml`,\n  `v2_decision_log.md`, plus a hand-authored\n  `relational_table_schemas.csv` documenting every column of every\n  relational table.  These match the GitHub-blob links cited below but\n  ship inside the bundle so a reviewer never needs network access.\n- **`<tier>/manifest.json`** — SHA-256 hash for every file plus the\n  full redaction contract (`structural_redactions.columns`,\n  `omitted_tables`, `relational_snapshot_safe`, `snapshot_day`).\n- Kaggle / HuggingFace preview pages additionally inject a\n  `schema.org/Dataset` JSON-LD block in their `<head>` for agent\n  ingestion without HTML parsing.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest  = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads   = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n    touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n#                    --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Dataset summary\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (acceptance band, gate G7.\\*) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping. The acceptance band is the recipe\ngate's tolerance window (`v1_acceptance_gates_bands.yaml` G7.\\*),\nnot the achievable range — observed five-seed spreads sit\ncomfortably inside the band.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier |\n|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 |\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended difficulty axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n  designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n  (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n  fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n  calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n  The instructor companion exposes the hidden graph for teaching, not\n  designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n  attributes.\n\n## Known limitations\n\n- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across\n  every tier. Difficulty is visible in AP, P@K, Brier, and value\n  capture. Treat AUC as a sanity check, not a difficulty signal.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n  is slightly negative in every tier (intro −0.0045, intermediate\n  −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n  features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n  [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n  out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n  all tiers and the per-channel rate spread is ≤0.05. The simulator\n  does not encode channel-conditional probabilities; channel-conditional\n  encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n  baked in; the cohort-shift gate (G6.4) is informational and will\n  bite in v2.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n  sales_activities, opportunities (public); plus customers and\n  subscriptions (instructor only). Per-row counts per bundle live in\n  `manifest.json`.\n- **Features.** 32 public columns grouped by analytical role in\n  [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n  the per-bundle `feature_dictionary.csv` is the authoritative\n  machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n  the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n  recorded in `tasks/converted_within_90_days/task_manifest.json`.\n  **Group-leakage warning:** the splitter is keyed on `lead_id` only,\n  not on `account_id` or `contact_id`. On the as-shipped intermediate\n  bundle, **518 of 557 test accounts (≈93 %) also appear in train**;\n  the contact-level overlap is similar in magnitude. A flat baseline\n  trained on the random split rides account-level signal across the\n  split boundary. For a generalisation-faithful number, retrain with\n  `GroupKFold(account_id)` (or `contact_id`) and report both — see\n  [`break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) §5 for the\n  detection recipe.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n  version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. The\n[break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under `.github/ISSUE_TEMPLATE/`: a\n[breakage report](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml)\nform for findings on the bundle itself, and a\n[realism feedback](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml)\nform for distributional critiques. Accepted findings are\nlogged in\n[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).\nFile issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate <bundle_dir>`; every file\nis hashed in `manifest.json`.\n",
+  "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting *which* leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.\n\n[^macro]: Macroeconomic framing summarised in\n[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier\n│   ├── manifest.json                 # provenance + file hashes\n│   ├── metrics.json                  # per-tier headline metrics (medians + spreads)\n│   ├── dataset_card.md               # auto-rendered per-bundle card\n│   ├── feature_dictionary.csv        # authoritative column spec\n│   ├── lead_scoring.csv              # flat convenience CSV (all splits)\n│   ├── tables/*.parquet              # 7 snapshot-safe relational tables\n│   └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)\n├── metrics.json                      # top-level cross-tier metrics summary\n├── claims_register.{md,json}         # claims → backing-artifact map (agent-readable)\n├── dataset-metadata.json             # Kaggle dataset metadata\n├── dataset-cover-image.png           # Kaggle cover image\n├── README.md                         # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n### Agent-reviewable artifacts\n\nThe published bundle is self-contained for AI review and offline\nauditing — every numeric / structural claim on this page can be\nverified without following an external link:\n\n- **`metrics.json` (root) + `<tier>/metrics.json`** — deterministic\n  JSON view of the headline LR AUC / AP / P@100 / Brier / conversion\n  rate / cohort-shift / cross-tier-ordering medians, with JSON-path\n  back-references to `validation/validation_report.json` (the\n  source of truth).\n- **`claims_register.{md,json}`** — every numerical or structural\n  claim on this page paired with the artifact and path that backs it.\n  Rendered from `claims_register_source.yaml` by\n  `scripts/build_claims_register.py`.\n- **`docs/`** — vendored copies of `generation_method.md`,\n  `channel_signal_audit.md`, `break_me_guide.md`,\n  `feature_dictionary.md`, `v1_acceptance_gates_bands.yaml`,\n  `v2_decision_log.md`, plus a hand-authored\n  `relational_table_schemas.csv` documenting every column of every\n  relational table.  These match the GitHub-blob links cited below but\n  ship inside the bundle so a reviewer never needs network access.\n- **`<tier>/manifest.json`** — SHA-256 hash for every file plus the\n  full redaction contract (`structural_redactions.columns`,\n  `omitted_tables`, `relational_snapshot_safe`, `snapshot_day`).\n- Kaggle / HuggingFace preview pages additionally inject a\n  `schema.org/Dataset` JSON-LD block in their `<head>` for agent\n  ingestion without HTML parsing.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest  = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads   = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n    touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n#                    --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Evaluation note — account overlap\n\n**518 of 557 test accounts (≈93 %) appear in train** on the intermediate\nbundle; the other tiers are similar. The random-split headline metrics\ntherefore ride account-level signal across the split boundary and\nover-estimate generalisation to unseen accounts. For a faithful\nout-of-sample number, retrain with `GroupKFold(account_id)` and report\nboth metrics. Notebook 02 demonstrates the detection recipe;\n[`break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) §5 gives\nthe worked example.\n\n## Dataset summary\n\n**Tiers are prevalence and noise axes, not modelling-complexity axes.**\nLR AUC is ~0.88 in every tier by design. The tiers differ in conversion\nrate, missingness, and noise — not rank discrimination. Choose a tier\nbased on the teaching exercise, not on expected AUC:\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| **Tier purpose** | High-prevalence warm-up | Default benchmark | Low-prevalence · calibration · noise exercise |\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 31 / 34* | 31 / 34* | 31 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (acceptance band, gate G7.\\*) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping. The acceptance band is the recipe\ngate's tolerance window (`v1_acceptance_gates_bands.yaml` G7.\\*),\nnot the achievable range — observed five-seed spreads sit\ncomfortably inside the band.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier | Cal. max-bin err |\n|---|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 | **0.52** |\n\n**Reading this table:** LR AUC is flat across tiers by design — the\ntiers are a prevalence / noise axis, not a rank-discrimination axis.\nBrier score *improves* as prevalence falls (a prevalence effect, not\nbetter calibration); use `calibration_max_bin_error` to assess\ncalibration quality. Advanced's 0.52 max-bin error means the model's\npredicted probabilities are materially mis-scaled against actual\nconversion rates — a realistic miscalibration exercise.\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended prevalence axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n  designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n  (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n  fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n  calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n  The instructor companion exposes the hidden graph for teaching, not\n  designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n  attributes.\n\n## Known limitations\n\n- **Tiers are a prevalence / noise axis, not a modelling-complexity\n  axis.** LR AUC is ~0.88 in every tier; the three tiers differ in\n  conversion rate (43% / 22% / 8%), noise scale, and missingness —\n  not in rank discrimination. Use AP, P@K, and calibration metrics\n  to see the difficulty gradient; AUC alone will not show it.\n- **93% account overlap across train / test splits.** Random splits are\n  keyed on lead ID; most test accounts also appear in train. Headline\n  metrics over-state generalisation to unseen accounts. Use\n  `GroupKFold(account_id)` for a faithful estimate.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n  is slightly negative in every tier (intro −0.0045, intermediate\n  −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n  features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n  [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n  out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n  all tiers and the per-channel rate spread is ≤0.05. The simulator\n  does not encode channel-conditional probabilities; channel-conditional\n  encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n  baked in; the cohort-shift gate (G6.4) is informational and will\n  bite in v2.\n- **Advanced-tier noise can produce non-physical values.** With\n  `noise_scale=0.55` and `missing_rate=18%`, Gaussian noise injection\n  can yield negative values in count and duration columns before MCAR\n  fill (e.g. a negative `days_since_last_touch`). These are treated\n  as real-world data-messiness artifacts; the snapshot builder clamps\n  them to zero, but some residual distortion is intentional as a\n  data-cleaning exercise.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n  sales_activities, opportunities (public); plus customers and\n  subscriptions (instructor only). Per-row counts per bundle live in\n  `manifest.json`.\n- **Features.** 32 public columns grouped by analytical role in\n  [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n  the per-bundle `feature_dictionary.csv` is the authoritative\n  machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n  the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n  recorded in `tasks/converted_within_90_days/task_manifest.json`.\n  Splits are keyed on `lead_id`; see the *Evaluation note* above for\n  the account-overlap caveat.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n  version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. The\n[break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under `.github/ISSUE_TEMPLATE/`: a\n[breakage report](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml)\nform for findings on the bundle itself, and a\n[realism feedback](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml)\nform for distributional critiques. Accepted findings are\nlogged in\n[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).\nFile issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate <bundle_dir>`; every file\nis hashed in `manifest.json`.\n",
   "expectedUpdateFrequency": "never",
   "id": "leadforge/leadforge-lead-scoring-v1",
   "image": "dataset-cover-image.png",
diff --git a/scripts/_release_common.py b/scripts/_release_common.py
index 8dcb34b..31d7c94 100644
--- a/scripts/_release_common.py
+++ b/scripts/_release_common.py
@@ -101,7 +101,6 @@ class ValidationError:
 │   ├── lead_scoring.csv              # flat convenience CSV (all splits)
 │   ├── tables/*.parquet              # 7 snapshot-safe relational tables
 │   └── tasks/converted_within_90_days/{train,valid,test}.parquet
-├── intermediate_instructor/          # research companion: full-horizon tables + metadata/
 ├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)
 ├── notebooks/                        # 01 baseline · 02 relational · 03 leakage · 04 calibration
 ├── metrics.json                      # top-level cross-tier metrics summary
diff --git a/scripts/package_hf_release.py b/scripts/package_hf_release.py
index 650d7e5..2c56208 100644
--- a/scripts/package_hf_release.py
+++ b/scripts/package_hf_release.py
@@ -87,9 +87,11 @@
 HF_SIZE_BUCKET_5K: Final[str] = "1K<n<10K"
 
 #: Public-tier configs land under ``intro``/``intermediate``/``advanced``;
-#: ``intermediate`` is the recommended entry point per G12.2 (the
-#: difficulty band sits between the two extremes).
-DEFAULT_DEFAULT_CONFIG: Final[str] = "intermediate"
+#: ``intro`` is the recommended entry point — students loading the dataset
+#: with no config argument land in the highest-prevalence tier, which is
+#: the most forgiving teaching context. ``intermediate`` is the default
+#: benchmark for graded assignments and course evaluation.
+DEFAULT_DEFAULT_CONFIG: Final[str] = "intro"
 
 #: Allowed HF dataset-card ``task_categories`` token for tabular
 #: binary classification.
diff --git a/tests/scripts/test_package_hf_release.py b/tests/scripts/test_package_hf_release.py
index f5bfc2a..80ff0d3 100644
--- a/tests/scripts/test_package_hf_release.py
+++ b/tests/scripts/test_package_hf_release.py
@@ -642,7 +642,7 @@ def test_committed_hf_readme_matches_fresh_regeneration(tmp_path: Path) -> None:
     assert len(fm["configs"]) == 3
     defaults = [c for c in fm["configs"] if c.get("default")]
     assert len(defaults) == 1
-    assert defaults[0]["config_name"] == "intermediate"
+    assert defaults[0]["config_name"] == "intro"  # intro is the default entry point
     # Body inherited the rewritten release card content.
     assert "What's inside" in body
     assert "Why lead scoring matters" in body

From 4f47d2a23921dd4ce734e1022abe8d6dbb5697fb Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Mon, 25 May 2026 15:24:21 +0300
Subject: [PATCH 2/2] =?UTF-8?q?fix(release):=20address=20PR=208.2=20review?=
 =?UTF-8?q?=20=E2=80=94=20isPrivate,=20column=20count,=20contact=20overlap?=
 =?UTF-8?q?,=20artifact=20zeros,=20metric=20header?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Issue 1 (CRITICAL): isPrivate was never actually fixed — packager
hardcoded isPrivate=True at build_metadata() call site. Changed to
isPrivate=False in scripts/package_kaggle_release.py. Committed
dataset-metadata.json regenerated; now has isPrivate: false.

Issue 2 (HIGH): Composition section said '32 public columns' but PR 8.1
dropped first_touch_channel, making the count 31. Summary table was
already correct (31 / 34*); prose was not updated. Fixed in both
release/README.md and (via packager) release/huggingface/README.md.

Issue 3 (HIGH): Added test_committed_kaggle_metadata_is_not_private()
to tests/scripts/test_package_kaggle_release.py. Reads the committed
JSON and asserts isPrivate is False with a clear failure message pointing
at the fix location. Prevents silent regression.

Issue 4 (MEDIUM): Contact-level overlap disclosure was silently deleted
from the Evaluation note. Restored: section renamed to 'Evaluation note
— account and contact overlap'; added sentence that contact-level overlap
is comparable in magnitude to account-level. Known limitations bullet
also updated to say 'account and contact overlap'.

Issue 5 (MEDIUM): Non-physical values bullet described pre-clamp
internals (negative values, recipe parameter names) rather than
user-observable state. Rewrote: users see artifact zeros (clamped
results), not negative values; removed noise_scale/missing_rate
parameter names; focused on the data-cleaning diagnostic (suspicious
zero clusters in count and duration columns).

Issue 6 (MEDIUM): 'Cal. max-bin err' column header was not mappable to
the canonical metric name calibration_max_bin_error in
validation_report.json. Changed to backtick-quoted canonical name.

Issue 7 (LOW): DEFAULT_DEFAULT_CONFIG comment said 'intermediate is the
default benchmark for graded assignments' — contradictory above a
constant that just changed from intermediate to intro. Rewrote as a
recommendation: 'pass default_config="intermediate" for graded
assignments'.

Tests: 1419 passed, 5 skipped (pre-existing notebook test on main).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 release/README.md                             | 38 ++++++++++---------
 .../huggingface_public.html                   | 38 ++++++++++---------
 release/_preview_committed/kaggle.html        | 38 ++++++++++---------
 release/huggingface/README.md                 | 38 ++++++++++---------
 release/kaggle/dataset-metadata.json          |  4 +-
 scripts/package_hf_release.py                 |  5 ++-
 scripts/package_kaggle_release.py             |  2 +-
 tests/scripts/test_package_kaggle_release.py  | 23 +++++++++++
 8 files changed, 113 insertions(+), 73 deletions(-)

diff --git a/release/README.md b/release/README.md
index 10f76c6..b058f6e 100644
--- a/release/README.md
+++ b/release/README.md
@@ -108,12 +108,14 @@ exception is `total_touches_all`, the leakage trap — flagged
 `leakage_risk=True` in `feature_dictionary.csv`. Drop it from your
 feature set unless you're demonstrating leakage detection.
 
-## Evaluation note — account overlap
+## Evaluation note — account and contact overlap
 
 **518 of 557 test accounts (≈93 %) appear in train** on the intermediate
-bundle; the other tiers are similar. The random-split headline metrics
-therefore ride account-level signal across the split boundary and
-over-estimate generalisation to unseen accounts. For a faithful
+bundle; the other tiers are similar. Contact-level overlap is comparable
+in magnitude: most test contacts also have activity in the training set.
+The random-split headline metrics therefore ride both account-level and
+contact-level signal across the split boundary and over-estimate
+generalisation to unseen accounts and contacts. For a faithful
 out-of-sample number, retrain with `GroupKFold(account_id)` and report
 both metrics. Notebook 02 demonstrates the detection recipe;
 [`break_me_guide.md`](../docs/release/break_me_guide.md) §5 gives
@@ -194,7 +196,7 @@ with bands declared in
 [`docs/release/v1_acceptance_gates_bands.yaml`](../docs/release/v1_acceptance_gates_bands.yaml).
 Headline cross-seed medians (seeds 42–46):
 
-| Tier | LR AUC | AP | P@100 | Brier | Cal. max-bin err |
+| Tier | LR AUC | AP | P@100 | Brier | `calibration_max_bin_error` |
 |---|---|---|---|---|---|
 | intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |
 | intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |
@@ -240,10 +242,11 @@ intended prevalence axis (intro > intermediate > advanced).
   conversion rate (43% / 22% / 8%), noise scale, and missingness —
   not in rank discrimination. Use AP, P@K, and calibration metrics
   to see the difficulty gradient; AUC alone will not show it.
-- **93% account overlap across train / test splits.** Random splits are
-  keyed on lead ID; most test accounts also appear in train. Headline
-  metrics over-state generalisation to unseen accounts. Use
-  `GroupKFold(account_id)` for a faithful estimate.
+- **93% account and contact overlap across train / test splits.** Random
+  splits are keyed on lead ID; most test accounts and contacts also
+  appear in train. Headline metrics over-state generalisation to unseen
+  accounts and contacts. Use `GroupKFold(account_id)` for a faithful
+  estimate.
 - **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta
   is slightly negative in every tier (intro −0.0045, intermediate
   −0.0072, advanced −0.0133); v1's snapshot is dominated by linear
@@ -257,13 +260,14 @@ intended prevalence axis (intro > intermediate > advanced).
 - **Cohort-shift degradation is small.** v1 has no time-of-year drift
   baked in; the cohort-shift gate (G6.4) is informational and will
   bite in v2.
-- **Advanced-tier noise can produce non-physical values.** With
-  `noise_scale=0.55` and `missing_rate=18%`, Gaussian noise injection
-  can yield negative values in count and duration columns before MCAR
-  fill (e.g. a negative `days_since_last_touch`). These are treated
-  as real-world data-messiness artifacts; the snapshot builder clamps
-  them to zero, but some residual distortion is intentional as a
-  data-cleaning exercise.
+- **Advanced-tier noise can produce artifact zeros in count and duration
+  columns.** Gaussian noise is applied before MCAR missingness; the
+  snapshot builder clamps results below zero to zero. What users observe
+  is therefore not negative values but zeros that may be noise artifacts
+  rather than true zero values — e.g. `days_since_last_touch = 0` might
+  mean "noised below zero, clamped" rather than "touched today". Treat
+  suspicious zero clusters in the Advanced tier as intentional
+  data-cleaning exercise material.
 
 ## Composition
 
@@ -271,7 +275,7 @@ intended prevalence axis (intro > intermediate > advanced).
   sales_activities, opportunities (public); plus customers and
   subscriptions (instructor only). Per-row counts per bundle live in
   `manifest.json`.
-- **Features.** 32 public columns grouped by analytical role in
+- **Features.** 31 public columns grouped by analytical role in
   [`docs/release/feature_dictionary.md`](../docs/release/feature_dictionary.md);
   the per-bundle `feature_dictionary.csv` is the authoritative
   machine-readable spec.
diff --git a/release/_preview_committed/huggingface_public.html b/release/_preview_committed/huggingface_public.html
index b7e6aef..0bd8e75 100644
--- a/release/_preview_committed/huggingface_public.html
+++ b/release/_preview_committed/huggingface_public.html
@@ -262,11 +262,13 @@ <h2>Quick start</h2>
 exception is <code>total_touches_all</code>, the leakage trap — flagged
 <code>leakage_risk=True</code> in <code>feature_dictionary.csv</code>. Drop it from your
 feature set unless you're demonstrating leakage detection.</p>
-<h2>Evaluation note — account overlap</h2>
+<h2>Evaluation note — account and contact overlap</h2>
 <p><strong>518 of 557 test accounts (≈93 %) appear in train</strong> on the intermediate
-bundle; the other tiers are similar. The random-split headline metrics
-therefore ride account-level signal across the split boundary and
-over-estimate generalisation to unseen accounts. For a faithful
+bundle; the other tiers are similar. Contact-level overlap is comparable
+in magnitude: most test contacts also have activity in the training set.
+The random-split headline metrics therefore ride both account-level and
+contact-level signal across the split boundary and over-estimate
+generalisation to unseen accounts and contacts. For a faithful
 out-of-sample number, retrain with <code>GroupKFold(account_id)</code> and report
 both metrics. Notebook 02 demonstrates the detection recipe;
 <a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md"><code>break_me_guide.md</code></a> §5 gives
@@ -433,7 +435,7 @@ <h2>Calibration</h2>
 <th>AP</th>
 <th>P@100</th>
 <th>Brier</th>
-<th>Cal. max-bin err</th>
+<th><code>calibration_max_bin_error</code></th>
 </tr>
 </thead>
 <tbody>
@@ -501,10 +503,11 @@ <h2>Known limitations</h2>
 conversion rate (43% / 22% / 8%), noise scale, and missingness —
 not in rank discrimination. Use AP, P@K, and calibration metrics
 to see the difficulty gradient; AUC alone will not show it.</li>
-<li><strong>93% account overlap across train / test splits.</strong> Random splits are
-keyed on lead ID; most test accounts also appear in train. Headline
-metrics over-state generalisation to unseen accounts. Use
-<code>GroupKFold(account_id)</code> for a faithful estimate.</li>
+<li><strong>93% account and contact overlap across train / test splits.</strong> Random
+splits are keyed on lead ID; most test accounts and contacts also
+appear in train. Headline metrics over-state generalisation to unseen
+accounts and contacts. Use <code>GroupKFold(account_id)</code> for a faithful
+estimate.</li>
 <li><strong>GBM does not consistently beat LR (gate G7.4.4).</strong> GBM−LR AUC delta
 is slightly negative in every tier (intro −0.0045, intermediate
 −0.0072, advanced −0.0133); v1's snapshot is dominated by linear
@@ -518,13 +521,14 @@ <h2>Known limitations</h2>
 <li><strong>Cohort-shift degradation is small.</strong> v1 has no time-of-year drift
 baked in; the cohort-shift gate (G6.4) is informational and will
 bite in v2.</li>
-<li><strong>Advanced-tier noise can produce non-physical values.</strong> With
-<code>noise_scale=0.55</code> and <code>missing_rate=18%</code>, Gaussian noise injection
-can yield negative values in count and duration columns before MCAR
-fill (e.g. a negative <code>days_since_last_touch</code>). These are treated
-as real-world data-messiness artifacts; the snapshot builder clamps
-them to zero, but some residual distortion is intentional as a
-data-cleaning exercise.</li>
+<li><strong>Advanced-tier noise can produce artifact zeros in count and duration
+columns.</strong> Gaussian noise is applied before MCAR missingness; the
+snapshot builder clamps results below zero to zero. What users observe
+is therefore not negative values but zeros that may be noise artifacts
+rather than true zero values — e.g. <code>days_since_last_touch = 0</code> might
+mean &quot;noised below zero, clamped&quot; rather than &quot;touched today&quot;. Treat
+suspicious zero clusters in the Advanced tier as intentional
+data-cleaning exercise material.</li>
 </ul>
 <h2>Composition</h2>
 <ul>
@@ -532,7 +536,7 @@ <h2>Composition</h2>
 sales_activities, opportunities (public); plus customers and
 subscriptions (instructor only). Per-row counts per bundle live in
 <code>manifest.json</code>.</li>
-<li><strong>Features.</strong> 32 public columns grouped by analytical role in
+<li><strong>Features.</strong> 31 public columns grouped by analytical role in
 <a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md"><code>docs/release/feature_dictionary.md</code></a>;
 the per-bundle <code>feature_dictionary.csv</code> is the authoritative
 machine-readable spec.</li>
diff --git a/release/_preview_committed/kaggle.html b/release/_preview_committed/kaggle.html
index a941c49..9779e33 100644
--- a/release/_preview_committed/kaggle.html
+++ b/release/_preview_committed/kaggle.html
@@ -246,11 +246,13 @@ <h2>Quick start</h2>
 exception is <code>total_touches_all</code>, the leakage trap — flagged
 <code>leakage_risk=True</code> in <code>feature_dictionary.csv</code>. Drop it from your
 feature set unless you're demonstrating leakage detection.</p>
-<h2>Evaluation note — account overlap</h2>
+<h2>Evaluation note — account and contact overlap</h2>
 <p><strong>518 of 557 test accounts (≈93 %) appear in train</strong> on the intermediate
-bundle; the other tiers are similar. The random-split headline metrics
-therefore ride account-level signal across the split boundary and
-over-estimate generalisation to unseen accounts. For a faithful
+bundle; the other tiers are similar. Contact-level overlap is comparable
+in magnitude: most test contacts also have activity in the training set.
+The random-split headline metrics therefore ride both account-level and
+contact-level signal across the split boundary and over-estimate
+generalisation to unseen accounts and contacts. For a faithful
 out-of-sample number, retrain with <code>GroupKFold(account_id)</code> and report
 both metrics. Notebook 02 demonstrates the detection recipe;
 <a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md"><code>break_me_guide.md</code></a> §5 gives
@@ -417,7 +419,7 @@ <h2>Calibration</h2>
 <th>AP</th>
 <th>P@100</th>
 <th>Brier</th>
-<th>Cal. max-bin err</th>
+<th><code>calibration_max_bin_error</code></th>
 </tr>
 </thead>
 <tbody>
@@ -485,10 +487,11 @@ <h2>Known limitations</h2>
 conversion rate (43% / 22% / 8%), noise scale, and missingness —
 not in rank discrimination. Use AP, P@K, and calibration metrics
 to see the difficulty gradient; AUC alone will not show it.</li>
-<li><strong>93% account overlap across train / test splits.</strong> Random splits are
-keyed on lead ID; most test accounts also appear in train. Headline
-metrics over-state generalisation to unseen accounts. Use
-<code>GroupKFold(account_id)</code> for a faithful estimate.</li>
+<li><strong>93% account and contact overlap across train / test splits.</strong> Random
+splits are keyed on lead ID; most test accounts and contacts also
+appear in train. Headline metrics over-state generalisation to unseen
+accounts and contacts. Use <code>GroupKFold(account_id)</code> for a faithful
+estimate.</li>
 <li><strong>GBM does not consistently beat LR (gate G7.4.4).</strong> GBM−LR AUC delta
 is slightly negative in every tier (intro −0.0045, intermediate
 −0.0072, advanced −0.0133); v1's snapshot is dominated by linear
@@ -502,13 +505,14 @@ <h2>Known limitations</h2>
 <li><strong>Cohort-shift degradation is small.</strong> v1 has no time-of-year drift
 baked in; the cohort-shift gate (G6.4) is informational and will
 bite in v2.</li>
-<li><strong>Advanced-tier noise can produce non-physical values.</strong> With
-<code>noise_scale=0.55</code> and <code>missing_rate=18%</code>, Gaussian noise injection
-can yield negative values in count and duration columns before MCAR
-fill (e.g. a negative <code>days_since_last_touch</code>). These are treated
-as real-world data-messiness artifacts; the snapshot builder clamps
-them to zero, but some residual distortion is intentional as a
-data-cleaning exercise.</li>
+<li><strong>Advanced-tier noise can produce artifact zeros in count and duration
+columns.</strong> Gaussian noise is applied before MCAR missingness; the
+snapshot builder clamps results below zero to zero. What users observe
+is therefore not negative values but zeros that may be noise artifacts
+rather than true zero values — e.g. <code>days_since_last_touch = 0</code> might
+mean &quot;noised below zero, clamped&quot; rather than &quot;touched today&quot;. Treat
+suspicious zero clusters in the Advanced tier as intentional
+data-cleaning exercise material.</li>
 </ul>
 <h2>Composition</h2>
 <ul>
@@ -516,7 +520,7 @@ <h2>Composition</h2>
 sales_activities, opportunities (public); plus customers and
 subscriptions (instructor only). Per-row counts per bundle live in
 <code>manifest.json</code>.</li>
-<li><strong>Features.</strong> 32 public columns grouped by analytical role in
+<li><strong>Features.</strong> 31 public columns grouped by analytical role in
 <a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md"><code>docs/release/feature_dictionary.md</code></a>;
 the per-bundle <code>feature_dictionary.csv</code> is the authoritative
 machine-readable spec.</li>
diff --git a/release/huggingface/README.md b/release/huggingface/README.md
index 3be25fa..2cfce2c 100644
--- a/release/huggingface/README.md
+++ b/release/huggingface/README.md
@@ -154,12 +154,14 @@ exception is `total_touches_all`, the leakage trap — flagged
 `leakage_risk=True` in `feature_dictionary.csv`. Drop it from your
 feature set unless you're demonstrating leakage detection.
 
-## Evaluation note — account overlap
+## Evaluation note — account and contact overlap
 
 **518 of 557 test accounts (≈93 %) appear in train** on the intermediate
-bundle; the other tiers are similar. The random-split headline metrics
-therefore ride account-level signal across the split boundary and
-over-estimate generalisation to unseen accounts. For a faithful
+bundle; the other tiers are similar. Contact-level overlap is comparable
+in magnitude: most test contacts also have activity in the training set.
+The random-split headline metrics therefore ride both account-level and
+contact-level signal across the split boundary and over-estimate
+generalisation to unseen accounts and contacts. For a faithful
 out-of-sample number, retrain with `GroupKFold(account_id)` and report
 both metrics. Notebook 02 demonstrates the detection recipe;
 [`break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) §5 gives
@@ -240,7 +242,7 @@ with bands declared in
 [`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).
 Headline cross-seed medians (seeds 42–46):
 
-| Tier | LR AUC | AP | P@100 | Brier | Cal. max-bin err |
+| Tier | LR AUC | AP | P@100 | Brier | `calibration_max_bin_error` |
 |---|---|---|---|---|---|
 | intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |
 | intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |
@@ -286,10 +288,11 @@ intended prevalence axis (intro > intermediate > advanced).
   conversion rate (43% / 22% / 8%), noise scale, and missingness —
   not in rank discrimination. Use AP, P@K, and calibration metrics
   to see the difficulty gradient; AUC alone will not show it.
-- **93% account overlap across train / test splits.** Random splits are
-  keyed on lead ID; most test accounts also appear in train. Headline
-  metrics over-state generalisation to unseen accounts. Use
-  `GroupKFold(account_id)` for a faithful estimate.
+- **93% account and contact overlap across train / test splits.** Random
+  splits are keyed on lead ID; most test accounts and contacts also
+  appear in train. Headline metrics over-state generalisation to unseen
+  accounts and contacts. Use `GroupKFold(account_id)` for a faithful
+  estimate.
 - **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta
   is slightly negative in every tier (intro −0.0045, intermediate
   −0.0072, advanced −0.0133); v1's snapshot is dominated by linear
@@ -303,13 +306,14 @@ intended prevalence axis (intro > intermediate > advanced).
 - **Cohort-shift degradation is small.** v1 has no time-of-year drift
   baked in; the cohort-shift gate (G6.4) is informational and will
   bite in v2.
-- **Advanced-tier noise can produce non-physical values.** With
-  `noise_scale=0.55` and `missing_rate=18%`, Gaussian noise injection
-  can yield negative values in count and duration columns before MCAR
-  fill (e.g. a negative `days_since_last_touch`). These are treated
-  as real-world data-messiness artifacts; the snapshot builder clamps
-  them to zero, but some residual distortion is intentional as a
-  data-cleaning exercise.
+- **Advanced-tier noise can produce artifact zeros in count and duration
+  columns.** Gaussian noise is applied before MCAR missingness; the
+  snapshot builder clamps results below zero to zero. What users observe
+  is therefore not negative values but zeros that may be noise artifacts
+  rather than true zero values — e.g. `days_since_last_touch = 0` might
+  mean "noised below zero, clamped" rather than "touched today". Treat
+  suspicious zero clusters in the Advanced tier as intentional
+  data-cleaning exercise material.
 
 ## Composition
 
@@ -317,7 +321,7 @@ intended prevalence axis (intro > intermediate > advanced).
   sales_activities, opportunities (public); plus customers and
   subscriptions (instructor only). Per-row counts per bundle live in
   `manifest.json`.
-- **Features.** 32 public columns grouped by analytical role in
+- **Features.** 31 public columns grouped by analytical role in
   [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);
   the per-bundle `feature_dictionary.csv` is the authoritative
   machine-readable spec.
diff --git a/release/kaggle/dataset-metadata.json b/release/kaggle/dataset-metadata.json
index 4036e83..051a9c3 100644
--- a/release/kaggle/dataset-metadata.json
+++ b/release/kaggle/dataset-metadata.json
@@ -1,10 +1,10 @@
 {
   "collaborators": [],
-  "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting *which* leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.\n\n[^macro]: Macroeconomic framing summarised in\n[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier\n│   ├── manifest.json                 # provenance + file hashes\n│   ├── metrics.json                  # per-tier headline metrics (medians + spreads)\n│   ├── dataset_card.md               # auto-rendered per-bundle card\n│   ├── feature_dictionary.csv        # authoritative column spec\n│   ├── lead_scoring.csv              # flat convenience CSV (all splits)\n│   ├── tables/*.parquet              # 7 snapshot-safe relational tables\n│   └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)\n├── metrics.json                      # top-level cross-tier metrics summary\n├── claims_register.{md,json}         # claims → backing-artifact map (agent-readable)\n├── dataset-metadata.json             # Kaggle dataset metadata\n├── dataset-cover-image.png           # Kaggle cover image\n├── README.md                         # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n### Agent-reviewable artifacts\n\nThe published bundle is self-contained for AI review and offline\nauditing — every numeric / structural claim on this page can be\nverified without following an external link:\n\n- **`metrics.json` (root) + `<tier>/metrics.json`** — deterministic\n  JSON view of the headline LR AUC / AP / P@100 / Brier / conversion\n  rate / cohort-shift / cross-tier-ordering medians, with JSON-path\n  back-references to `validation/validation_report.json` (the\n  source of truth).\n- **`claims_register.{md,json}`** — every numerical or structural\n  claim on this page paired with the artifact and path that backs it.\n  Rendered from `claims_register_source.yaml` by\n  `scripts/build_claims_register.py`.\n- **`docs/`** — vendored copies of `generation_method.md`,\n  `channel_signal_audit.md`, `break_me_guide.md`,\n  `feature_dictionary.md`, `v1_acceptance_gates_bands.yaml`,\n  `v2_decision_log.md`, plus a hand-authored\n  `relational_table_schemas.csv` documenting every column of every\n  relational table.  These match the GitHub-blob links cited below but\n  ship inside the bundle so a reviewer never needs network access.\n- **`<tier>/manifest.json`** — SHA-256 hash for every file plus the\n  full redaction contract (`structural_redactions.columns`,\n  `omitted_tables`, `relational_snapshot_safe`, `snapshot_day`).\n- Kaggle / HuggingFace preview pages additionally inject a\n  `schema.org/Dataset` JSON-LD block in their `<head>` for agent\n  ingestion without HTML parsing.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest  = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads   = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n    touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n#                    --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Evaluation note — account overlap\n\n**518 of 557 test accounts (≈93 %) appear in train** on the intermediate\nbundle; the other tiers are similar. The random-split headline metrics\ntherefore ride account-level signal across the split boundary and\nover-estimate generalisation to unseen accounts. For a faithful\nout-of-sample number, retrain with `GroupKFold(account_id)` and report\nboth metrics. Notebook 02 demonstrates the detection recipe;\n[`break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) §5 gives\nthe worked example.\n\n## Dataset summary\n\n**Tiers are prevalence and noise axes, not modelling-complexity axes.**\nLR AUC is ~0.88 in every tier by design. The tiers differ in conversion\nrate, missingness, and noise — not rank discrimination. Choose a tier\nbased on the teaching exercise, not on expected AUC:\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| **Tier purpose** | High-prevalence warm-up | Default benchmark | Low-prevalence · calibration · noise exercise |\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 31 / 34* | 31 / 34* | 31 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (acceptance band, gate G7.\\*) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping. The acceptance band is the recipe\ngate's tolerance window (`v1_acceptance_gates_bands.yaml` G7.\\*),\nnot the achievable range — observed five-seed spreads sit\ncomfortably inside the band.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier | Cal. max-bin err |\n|---|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 | **0.52** |\n\n**Reading this table:** LR AUC is flat across tiers by design — the\ntiers are a prevalence / noise axis, not a rank-discrimination axis.\nBrier score *improves* as prevalence falls (a prevalence effect, not\nbetter calibration); use `calibration_max_bin_error` to assess\ncalibration quality. Advanced's 0.52 max-bin error means the model's\npredicted probabilities are materially mis-scaled against actual\nconversion rates — a realistic miscalibration exercise.\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended prevalence axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n  designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n  (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n  fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n  calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n  The instructor companion exposes the hidden graph for teaching, not\n  designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n  attributes.\n\n## Known limitations\n\n- **Tiers are a prevalence / noise axis, not a modelling-complexity\n  axis.** LR AUC is ~0.88 in every tier; the three tiers differ in\n  conversion rate (43% / 22% / 8%), noise scale, and missingness —\n  not in rank discrimination. Use AP, P@K, and calibration metrics\n  to see the difficulty gradient; AUC alone will not show it.\n- **93% account overlap across train / test splits.** Random splits are\n  keyed on lead ID; most test accounts also appear in train. Headline\n  metrics over-state generalisation to unseen accounts. Use\n  `GroupKFold(account_id)` for a faithful estimate.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n  is slightly negative in every tier (intro −0.0045, intermediate\n  −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n  features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n  [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n  out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n  all tiers and the per-channel rate spread is ≤0.05. The simulator\n  does not encode channel-conditional probabilities; channel-conditional\n  encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n  baked in; the cohort-shift gate (G6.4) is informational and will\n  bite in v2.\n- **Advanced-tier noise can produce non-physical values.** With\n  `noise_scale=0.55` and `missing_rate=18%`, Gaussian noise injection\n  can yield negative values in count and duration columns before MCAR\n  fill (e.g. a negative `days_since_last_touch`). These are treated\n  as real-world data-messiness artifacts; the snapshot builder clamps\n  them to zero, but some residual distortion is intentional as a\n  data-cleaning exercise.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n  sales_activities, opportunities (public); plus customers and\n  subscriptions (instructor only). Per-row counts per bundle live in\n  `manifest.json`.\n- **Features.** 32 public columns grouped by analytical role in\n  [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n  the per-bundle `feature_dictionary.csv` is the authoritative\n  machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n  the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n  recorded in `tasks/converted_within_90_days/task_manifest.json`.\n  Splits are keyed on `lead_id`; see the *Evaluation note* above for\n  the account-overlap caveat.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n  version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. The\n[break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under `.github/ISSUE_TEMPLATE/`: a\n[breakage report](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml)\nform for findings on the bundle itself, and a\n[realism feedback](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml)\nform for distributional critiques. Accepted findings are\nlogged in\n[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).\nFile issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate <bundle_dir>`; every file\nis hashed in `manifest.json`.\n",
+  "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting *which* leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.\n\n[^macro]: Macroeconomic framing summarised in\n[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier\n│   ├── manifest.json                 # provenance + file hashes\n│   ├── metrics.json                  # per-tier headline metrics (medians + spreads)\n│   ├── dataset_card.md               # auto-rendered per-bundle card\n│   ├── feature_dictionary.csv        # authoritative column spec\n│   ├── lead_scoring.csv              # flat convenience CSV (all splits)\n│   ├── tables/*.parquet              # 7 snapshot-safe relational tables\n│   └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)\n├── metrics.json                      # top-level cross-tier metrics summary\n├── claims_register.{md,json}         # claims → backing-artifact map (agent-readable)\n├── dataset-metadata.json             # Kaggle dataset metadata\n├── dataset-cover-image.png           # Kaggle cover image\n├── README.md                         # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n### Agent-reviewable artifacts\n\nThe published bundle is self-contained for AI review and offline\nauditing — every numeric / structural claim on this page can be\nverified without following an external link:\n\n- **`metrics.json` (root) + `<tier>/metrics.json`** — deterministic\n  JSON view of the headline LR AUC / AP / P@100 / Brier / conversion\n  rate / cohort-shift / cross-tier-ordering medians, with JSON-path\n  back-references to `validation/validation_report.json` (the\n  source of truth).\n- **`claims_register.{md,json}`** — every numerical or structural\n  claim on this page paired with the artifact and path that backs it.\n  Rendered from `claims_register_source.yaml` by\n  `scripts/build_claims_register.py`.\n- **`docs/`** — vendored copies of `generation_method.md`,\n  `channel_signal_audit.md`, `break_me_guide.md`,\n  `feature_dictionary.md`, `v1_acceptance_gates_bands.yaml`,\n  `v2_decision_log.md`, plus a hand-authored\n  `relational_table_schemas.csv` documenting every column of every\n  relational table.  These match the GitHub-blob links cited below but\n  ship inside the bundle so a reviewer never needs network access.\n- **`<tier>/manifest.json`** — SHA-256 hash for every file plus the\n  full redaction contract (`structural_redactions.columns`,\n  `omitted_tables`, `relational_snapshot_safe`, `snapshot_day`).\n- Kaggle / HuggingFace preview pages additionally inject a\n  `schema.org/Dataset` JSON-LD block in their `<head>` for agent\n  ingestion without HTML parsing.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest  = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads   = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n    touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n#                    --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Evaluation note — account and contact overlap\n\n**518 of 557 test accounts (≈93 %) appear in train** on the intermediate\nbundle; the other tiers are similar. Contact-level overlap is comparable\nin magnitude: most test contacts also have activity in the training set.\nThe random-split headline metrics therefore ride both account-level and\ncontact-level signal across the split boundary and over-estimate\ngeneralisation to unseen accounts and contacts. For a faithful\nout-of-sample number, retrain with `GroupKFold(account_id)` and report\nboth metrics. Notebook 02 demonstrates the detection recipe;\n[`break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) §5 gives\nthe worked example.\n\n## Dataset summary\n\n**Tiers are prevalence and noise axes, not modelling-complexity axes.**\nLR AUC is ~0.88 in every tier by design. The tiers differ in conversion\nrate, missingness, and noise — not rank discrimination. Choose a tier\nbased on the teaching exercise, not on expected AUC:\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| **Tier purpose** | High-prevalence warm-up | Default benchmark | Low-prevalence · calibration · noise exercise |\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 31 / 34* | 31 / 34* | 31 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (acceptance band, gate G7.\\*) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping. The acceptance band is the recipe\ngate's tolerance window (`v1_acceptance_gates_bands.yaml` G7.\\*),\nnot the achievable range — observed five-seed spreads sit\ncomfortably inside the band.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier | `calibration_max_bin_error` |\n|---|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 | **0.52** |\n\n**Reading this table:** LR AUC is flat across tiers by design — the\ntiers are a prevalence / noise axis, not a rank-discrimination axis.\nBrier score *improves* as prevalence falls (a prevalence effect, not\nbetter calibration); use `calibration_max_bin_error` to assess\ncalibration quality. Advanced's 0.52 max-bin error means the model's\npredicted probabilities are materially mis-scaled against actual\nconversion rates — a realistic miscalibration exercise.\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended prevalence axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n  designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n  (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n  fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n  calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n  The instructor companion exposes the hidden graph for teaching, not\n  designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n  attributes.\n\n## Known limitations\n\n- **Tiers are a prevalence / noise axis, not a modelling-complexity\n  axis.** LR AUC is ~0.88 in every tier; the three tiers differ in\n  conversion rate (43% / 22% / 8%), noise scale, and missingness —\n  not in rank discrimination. Use AP, P@K, and calibration metrics\n  to see the difficulty gradient; AUC alone will not show it.\n- **93% account and contact overlap across train / test splits.** Random\n  splits are keyed on lead ID; most test accounts and contacts also\n  appear in train. Headline metrics over-state generalisation to unseen\n  accounts and contacts. Use `GroupKFold(account_id)` for a faithful\n  estimate.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n  is slightly negative in every tier (intro −0.0045, intermediate\n  −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n  features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n  [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n  out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n  all tiers and the per-channel rate spread is ≤0.05. The simulator\n  does not encode channel-conditional probabilities; channel-conditional\n  encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n  baked in; the cohort-shift gate (G6.4) is informational and will\n  bite in v2.\n- **Advanced-tier noise can produce artifact zeros in count and duration\n  columns.** Gaussian noise is applied before MCAR missingness; the\n  snapshot builder clamps results below zero to zero. What users observe\n  is therefore not negative values but zeros that may be noise artifacts\n  rather than true zero values — e.g. `days_since_last_touch = 0` might\n  mean \"noised below zero, clamped\" rather than \"touched today\". Treat\n  suspicious zero clusters in the Advanced tier as intentional\n  data-cleaning exercise material.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n  sales_activities, opportunities (public); plus customers and\n  subscriptions (instructor only). Per-row counts per bundle live in\n  `manifest.json`.\n- **Features.** 31 public columns grouped by analytical role in\n  [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n  the per-bundle `feature_dictionary.csv` is the authoritative\n  machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n  the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n  recorded in `tasks/converted_within_90_days/task_manifest.json`.\n  Splits are keyed on `lead_id`; see the *Evaluation note* above for\n  the account-overlap caveat.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n  version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. The\n[break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under `.github/ISSUE_TEMPLATE/`: a\n[breakage report](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml)\nform for findings on the bundle itself, and a\n[realism feedback](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml)\nform for distributional critiques. Accepted findings are\nlogged in\n[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).\nFile issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate <bundle_dir>`; every file\nis hashed in `manifest.json`.\n",
   "expectedUpdateFrequency": "never",
   "id": "leadforge/leadforge-lead-scoring-v1",
   "image": "dataset-cover-image.png",
-  "isPrivate": true,
+  "isPrivate": false,
   "keywords": [
     "b2b",
     "classification",
diff --git a/scripts/package_hf_release.py b/scripts/package_hf_release.py
index 2c56208..3361492 100644
--- a/scripts/package_hf_release.py
+++ b/scripts/package_hf_release.py
@@ -89,8 +89,9 @@
 #: Public-tier configs land under ``intro``/``intermediate``/``advanced``;
 #: ``intro`` is the recommended entry point — students loading the dataset
 #: with no config argument land in the highest-prevalence tier, which is
-#: the most forgiving teaching context. ``intermediate`` is the default
-#: benchmark for graded assignments and course evaluation.
+#: the most forgiving teaching context. Pass ``default_config="intermediate"``
+#: for graded assignments; ``default_config="advanced"`` for calibration
+#: and noise-handling exercises.
 DEFAULT_DEFAULT_CONFIG: Final[str] = "intro"
 
 #: Allowed HF dataset-card ``task_categories`` token for tabular
diff --git a/scripts/package_kaggle_release.py b/scripts/package_kaggle_release.py
index 1fc8f8a..792c59f 100644
--- a/scripts/package_kaggle_release.py
+++ b/scripts/package_kaggle_release.py
@@ -821,7 +821,7 @@ def build_metadata(
         id=f"{owner}/{dataset_slug}",
         subtitle=subtitle,
         description=description,
-        isPrivate=True,
+        isPrivate=False,
         licenses=(LicenseSpec(name=license_name),),
         keywords=tuple(keywords),
         collaborators=(),
diff --git a/tests/scripts/test_package_kaggle_release.py b/tests/scripts/test_package_kaggle_release.py
index ec45f0b..9771765 100644
--- a/tests/scripts/test_package_kaggle_release.py
+++ b/tests/scripts/test_package_kaggle_release.py
@@ -632,3 +632,26 @@ def test_committed_kaggle_metadata_matches_fresh_regeneration(tmp_path: Path) ->
     # Per-tier metrics.json is also enumerated.
     for tier in packager.DEFAULT_TIERS:
         assert f"{tier}/metrics.json" in paths
+
+
+@pytest.mark.skipif(
+    not _COMMITTED_METADATA.exists(),
+    reason="committed dataset-metadata.json missing",
+)
+def test_committed_kaggle_metadata_is_not_private() -> None:
+    """Guard the isPrivate publish blocker from silently regressing.
+
+    Publishing with ``isPrivate: true`` hides the dataset from public
+    on Kaggle without raising an error.  The committed metadata must
+    have ``isPrivate: false``; the packager must also produce
+    ``isPrivate: false`` (covered by
+    ``test_committed_kaggle_metadata_matches_fresh_regeneration``
+    combined with this assertion on the committed file).
+    """
+    meta = json.loads(_COMMITTED_METADATA.read_text(encoding="utf-8"))
+    assert meta.get("isPrivate") is False, (
+        "dataset-metadata.json has isPrivate: true — this is a publish "
+        "blocker: the dataset will be hidden from public on Kaggle. "
+        "Fix: set isPrivate=False in build_metadata() in "
+        "scripts/package_kaggle_release.py, then re-run --dry-run."
+    )