From d3192fa24ec4cf8ef2607d1a7e28017401397913 Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Thu, 7 May 2026 23:57:46 +0300
Subject: [PATCH 1/4] PR 6.3: break-me guide + issue templates + v2 decision
 log
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Phase 6 closer.  Five new artefacts plus three follow-up syncs:

- docs/release/break_me_guide.md — adversarial playbook.  Meta-recipe
  (read the dictionary → ablate, don't just probe → check the time
  window → treat the train/test split as untrusted) plus 9 patterns
  grouped by category.  Each pattern carries a "how to detect on any
  dataset" recipe and a worked-example pointer back into the v1 bundle
  (notebook §, validation_report JSON path, or feature_dictionary.csv
  field).  Delivers the three explicit promises notebook 04 §10 made
  (target-encoding leakage, train-test contamination via account_id,
  cohort-by-segment) plus six others (naming smells, standalone-AUC vs
  tree-ablation gap, time-window violations, value-aware ranking
  inversions, threshold-vs-rank ties, calibration drift across
  segments).

- docs/release/v2_decision_log.md — empty stub with the schema (7
  columns: received_at / source / topic / severity / verdict /
  next_step / link) and verdict vocabulary documented in the preamble.

- .github/ISSUE_TEMPLATE/dataset_breakage_report.yml — GitHub Issue
  Forms.  Fields: tier, seed, bundle hash, suggested triage label,
  severity, summary, repro, expected-vs-actual, environment, two
  confirmation checkboxes.

- .github/ISSUE_TEMPLATE/realism_feedback.yml — GitHub Issue Forms.
  Fields: aspect, tier(s)-affected, domain experience, claim, data
  observation, suggested fix, severity, two confirmations.

- release/README.md — "Maintenance, adversarial framing, license"
  section now links to the break-me guide, both issue templates, and
  the v2 decision log.  _release_common.py's existing relative-link
  rewriter handles the Kaggle/HF rendering automatically; the
  regenerated release/kaggle/dataset-metadata.json and
  release/huggingface/README.md sync are bundled in this commit and
  pass the audit-artifact-sync tests.

Notebook 03 §7 and notebook 04 §10 forward-pointers upgraded from
plain `docs/release/break_me_guide.md` text to Markdown links pointing
at the GitHub blob URL — relative path would break on Kaggle/HF where
notebooks ship without the docs/ tree, the blob URL works in both
contexts.

Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy
clean; leakage probes 0/3 on every public tier; hash determinism PASS
67/67; validate_release_candidate --no-rebuild exits 0;
BUNDLE_SCHEMA_VERSION unchanged at 5 (this PR is documentation-only).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .agent-plan.md                                |   3 +-
 .../dataset_breakage_report.yml               | 140 +++++++
 .github/ISSUE_TEMPLATE/realism_feedback.yml   | 116 ++++++
 docs/release/break_me_guide.md                | 341 ++++++++++++++++++
 docs/release/v2_decision_log.md               |  38 ++
 release/README.md                             |  18 +-
 release/huggingface/README.md                 |  18 +-
 release/kaggle/dataset-metadata.json          |   2 +-
 .../03_leakage_and_time_windows.ipynb         |   2 +-
 .../04_lift_calibration_value_ranking.ipynb   |   2 +-
 scripts/build_release_notebook_03.py          |  10 +-
 scripts/build_release_notebook_04.py          |  10 +-
 12 files changed, 676 insertions(+), 24 deletions(-)
 create mode 100644 .github/ISSUE_TEMPLATE/dataset_breakage_report.yml
 create mode 100644 .github/ISSUE_TEMPLATE/realism_feedback.yml
 create mode 100644 docs/release/break_me_guide.md
 create mode 100644 docs/release/v2_decision_log.md

diff --git a/.agent-plan.md b/.agent-plan.md
index 99821f3..33891c2 100644
--- a/.agent-plan.md
+++ b/.agent-plan.md
@@ -60,8 +60,7 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
 ### Phase 6 — Notebook sequence + adversarial framing
 - [x] PR 6.1: `release/notebooks/01_baseline_lead_scoring.ipynb` refreshed and `release/notebooks/02_relational_feature_engineering.ipynb` added.  Notebook 01 trains LR + HistGBM on the public `intermediate` bundle using the **same feature set as the validation report** (drops only IDs and the label, mirrors `release_quality._partition_columns`), so the G13.2 reproduction gate compares apples to apples.  This means notebook 01 **keeps** `total_touches_all` (the documented leakage trap) — narrative cell calls it out explicitly and forward-points to notebook 03 (PR 6.2) which dissects what dropping the trap does to performance.  Notebook 02 by contrast **drops** the trap from the flat baseline so the relational lift attribution stays clean (its goal is teaching feature engineering, not reproducing the report).  Targets are loaded at runtime from `release/notebooks/_release_targets.json` (audit-synced against `release/validation/validation_report.json` by `tests/release/notebooks/test_release_targets_match_report.py`); per-metric tolerances replace the original flat ±0.05 (AUC/Brier ±0.02, AP / top-decile ±0.05).  Notebook 02 loads the seven snapshot-safe public tables, asserts every event-table `timestamp <= lead_created_at + snapshot_day` inline (with real min-headroom-under-cutoff readings, not a hardcoded literal), demonstrates four legal joins (touch-channel breakdown, account-level density fit on **train leads only**, sales-activity recency, train-only industry target encoding), trains LR + GBM on flat-baseline-only and flat+relational features, prints a 4-row metric panel + delta panel, and pins the four model AUCs and the headline `GBM(eng) − GBM(flat)` lift via `assert_within_tolerance` (sign-aware `assert lift > 0` on top of the absolute tolerance).  Honest takeaway cell frames the +0.0147 AUC lift as suggestive, not conclusive (the cross-seed `gbm_auc` spread on this bundle is ~0.027); seed-sweep harness lands in PR 6.2's notebook 04.  Both notebooks ship inside the public release bundle alongside the parquet tables (Kaggle/HF consumers download them together) so they import a sibling `release/notebooks/_notebook_utils.py` rather than rely on the `leadforge` package — `precision_at_k` and `top_decile_rate` mirror `release_quality._precision_at_k` / `_top_decile_rate` (locked in by mirror tests), and `assert_within_tolerance` is hardened against silent passes on non-finite metrics or incomplete per-metric tolerance maps.  G13.1 acceptance gate wired: new `[notebooks]` extra (`nbclient`, `nbformat`, `scikit-learn`, `matplotlib`) and a dedicated `notebooks` CI job that regenerates the intermediate bundle via `python scripts/build_public_release.py release --tier intermediate` (only tier the notebooks need) then nbclient-executes both notebooks end-to-end (`tests/release/notebooks/test_execute_notebooks.py`, parametrised, gated on bundles-present).  G13.3 path discipline enforced inline: notebook 01 hard-codes `BUNDLE = Path("../intermediate")` and asserts `manifest.exposure_mode == "student_public"`; notebook 02 explicitly excludes `customers`/`subscriptions` per `BANNED_TABLES`.  Builders (`scripts/build_release_notebook_{01,02}.py`, sharing `scripts/_release_notebook_common.py`) emit deterministic byte-for-byte notebook JSON via explicit `cell_NNN` IDs (audit-artifact-sync pattern from PR 4.1 / 5.1 / 5.2, locked in by `tests/scripts/test_release_notebook_builders.py` which builds twice into `tmp_path` via the new `--out PATH` flag and diffs against the committed file without ever touching the working tree) and shell out to `ruff format` on the emitted file so builder output and pre-commit hook agree.  Net: 1250/1250 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
 - [x] PR 6.2: `release/notebooks/03_leakage_and_time_windows.ipynb` and `release/notebooks/04_lift_calibration_value_ranking.ipynb` added.  Notebook 03 turns the documented `total_touches_all` trap into a teaching moment: reads the trap label off `feature_dictionary.csv`, proves the trap by construction via a same-table comparison of `total_touches_all` (full-horizon) vs `touch_count` (snapshot-safe) — the post-snapshot delta sums to ~3.2 touches/lead and 82 % of leads have a positive delta — and then runs a standalone-AUC probe on the trap (~0.53 AUC, looks innocuous) followed by a side-by-side full-panel ± trap ablation that shows HistGBM extracts ~+0.032 AUC from the same column LR can only squeeze ~+0.009 from.  The reframed pedagogy (vs the prompt's original "trap dominates a thin firmographic set" framing) is empirically driven: firmographic-only is at chance AUC even with the trap, but the GBM-vs-LR asymmetry on the strong panel is a real and useful finding — *standalone AUC probes undersell tree-friendly leakage*.  Sign-aware tolerance gate pins each AUC ±0.02 and asserts `gbm_lift > 0.015` so a future regeneration that erases the trap or accidentally amplifies it breaks CI.  Notebook 04 covers the four extra ranking lenses AUC alone misses: calibration / reliability diagram (max bin error ≈ 0.13), lift + cumulative gains (top-decile lift 2.75×), value-aware ranking via `expected_acv × P(convert)` (top-50 ACV-capture jumps from 0.16 to 0.40), threshold selection for fixed top-K capacity, cohort-shift evaluation (HistGBM on the first 85 % chronologically → score the last 15 %, mirrors `release_quality.measure_cohort_shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and `model_random_state=0`, **reproduces the report's `cohort_shift.intermediate` block exactly**: 0.8754 / 0.8908 / −0.0155), and a 200-iter bootstrap of the test-set AUC/AP as the within-bundle confidence band that public-bundle consumers (Kaggle / HF) can run without `leadforge` installed (the prompt's "seed-sweep harness" with bootstrap honestly acknowledged as the proxy for true cross-seed sweep, since rebuilding bundles isn't an option for downstream users).  Cohort-shift values are pinned via a new `cohort_shift.intermediate` block in `release/notebooks/_release_targets.json` (audit-synced against `validation_report.cohort_shift.intermediate` by a new `test_cohort_shift_targets_match_validation_report` extension to the existing audit-sync test).  Headline LR/GBM panel **drops** `total_touches_all` (matches notebook 02's posture, gives honest production numbers); cohort-shift section deliberately **keeps** the trap to reproduce the report's published cohort-shift numbers exactly — divergent posture explained inline.  Both new builders (`scripts/build_release_notebook_{03,04}.py`) inherit the deterministic-cell-ID + `--out` byte-stability pattern from PR 6.1 and are added to `_BUILDERS` / `_NOTEBOOKS` in `tests/scripts/test_release_notebook_builders.py` and `tests/release/notebooks/test_execute_notebooks.py`.  Both notebooks execute end-to-end in <10s each (well under G13.1's 3-min budget), assert `manifest.exposure_mode == "student_public"` (G13.3), and load only from `release/intermediate/`.  Forward-pointer to `docs/release/break_me_guide.md` left as plain backtick-wrapped text — file lands in PR 6.3, no dead Markdown link.  Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
-- [ ] `.github/ISSUE_TEMPLATE/{dataset_breakage_report,realism_feedback}.yml`
-- [ ] `docs/release/{break_me_guide,v2_decision_log}.md`
+- [x] PR 6.3: adversarial framing landed.  `docs/release/break_me_guide.md` (new) — meta-recipe playbook organised as a 4-step recipe (read the dictionary → ablate, don't just probe → check the time window → treat the train/test split as untrusted) + 9 patterns grouped by category (leakage / split discipline / metric and ranking traps / robustness and realism).  Each pattern carries a "how to detect on any dataset" recipe and a "worked example" pointer back into the v1 bundle (notebook §, validation_report JSON path, or feature_dictionary.csv field), so the guide extends the notebooks rather than duplicating them.  Three explicit promises notebook 04 §10 made are delivered: target-encoding leakage on test (pattern 4, anchored on NB02 §4.4), train-test contamination via `account_id` overlap (pattern 5, with the honest "v1 only checks `lead_id`, not `account_id`" caveat), cohort-by-segment evaluation (pattern 6, extends NB04 §7's tier-wide cohort-shift to per-segment using the actual segment columns: `industry`, `region`, `employee_band`, `estimated_revenue_band`).  Other 6 patterns: naming smells, standalone-AUC vs tree-ablation gap (NB03 finding generalised), time-window violations on engineered features (with the `customers`-table example), value-aware ranking surprises (P × ACV vs P-only), threshold-vs-rank ties at the operating point (NB04 §6 finding), calibration drift across cohorts and segments.  Triage-label table at the top (`critical-leakage` / `realism` / `difficulty` / `documentation` / `platform` / `notebook` / `pedagogy` / `v2-idea` / `out-of-scope-v1`) gives reporters a vocabulary; the same labels are auto-applied (`needs-triage`) by the issue templates.  `docs/release/v2_decision_log.md` (new, empty stub) — schema documented in the file's preamble (7 columns: `received_at` / `source` / `topic` / `severity` / `verdict` / `next_step` / `link`; verdict vocabulary `accepted-for-v2` / `deferred` / `wont-fix` / `needs-investigation` with explicit semantics for each).  `.github/ISSUE_TEMPLATE/dataset_breakage_report.yml` (new) and `.github/ISSUE_TEMPLATE/realism_feedback.yml` (new) — GitHub Issue Forms YAML, both carry the `dataset: leadforge-lead-scoring-v1` + `needs-triage` labels.  Breakage report: tier dropdown (intro / intermediate / advanced / instructor / multiple), seed input (default 42), bundle hash field (validation: required), suggested triage label dropdown, severity dropdown (high/medium/low), summary, minimal repro, expected-vs-actual citing JSON paths, environment, two confirmation checkboxes (read break-me guide; reporting on as-shipped bundle).  Realism feedback: aspect dropdown (industry mix / persona / funnel timing / channel / pricing / account-to-lead density / region / other), tier(s)-affected dropdown, domain-experience one-liner (required — helps weight findings), claim, data observation (with concrete pandas-snippet placeholder example), suggested fix (optional), severity, two confirmations (read README "Known limitations"; checked post_v1_roadmap + v2_decision_log).  Notebook 03 §7 and notebook 04 §10 forward-pointers upgraded from plain `docs/release/break_me_guide.md` text to Markdown links pointing at the GitHub blob URL (`https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md`) — relative path would break on Kaggle/HF where notebooks ship without the `docs/` tree, the blob URL works in both contexts.  `release/README.md` "Maintenance, adversarial framing, license" section rewritten: dead "(PR 6.3)" forward-pointers replaced with real Markdown links to the break-me guide, both issue templates, and the v2 decision log; `_release_common.py`'s existing `](../foo)` → GitHub-blob-URL rewriter handles the Kaggle/HF rendering automatically (verified by the regenerated `release/kaggle/dataset-metadata.json` and `release/huggingface/README.md` sync tests).  Hostile-reviewer self-review caught two factual hallucinations in the first revision before they shipped: claimed "15 industries" for `industry` (actually 4: logistics / healthcare_non_clinical / manufacturing / professional_services) and used loose segment-column names ("employee tier", "ARR band") instead of the actual columns (`employee_band`, `estimated_revenue_band`); both fixed.  Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR is documentation-only).  Phase 6 closed — Phase 7 (LLM critique + publish) is next.
 
 ### Phase 7 — LLM critique + publish
 - [ ] `leadforge/validation/llm_critique.py` (single-provider, env-var creds, skips cleanly)
diff --git a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml
new file mode 100644
index 0000000..0923386
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml
@@ -0,0 +1,140 @@
+name: Dataset breakage report
+description: I broke leadforge-lead-scoring-v1 — leakage, train/test contamination, ranking surprise, or a notebook that won't execute. See docs/release/break_me_guide.md for the playbook this template is built around.
+title: "[breakage] "
+labels: ["dataset: leadforge-lead-scoring-v1", "needs-triage"]
+body:
+  - type: markdown
+    attributes:
+      value: |
+        Thank you for breaking the dataset on purpose. This template is for findings that affect *what's in the bundle* — leakage, split contamination, metric inversions, notebook failures. Distributional / realism critiques (e.g. "industry mix doesn't look like real procurement") belong in the [realism feedback template](realism_feedback.yml) instead.
+
+        The [break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues the patterns this template is shaped around.
+
+  - type: dropdown
+    id: tier
+    attributes:
+      label: Tier
+      description: Which bundle tier did you break? Pick `instructor` if the finding is on the `intermediate_instructor` companion.
+      options:
+        - intro
+        - intermediate
+        - advanced
+        - instructor
+        - multiple
+    validations:
+      required: true
+
+  - type: input
+    id: seed
+    attributes:
+      label: Seed
+      description: Generation seed of the bundle you used. Default canonical seed is 42; the published cross-seed sweep covers 42–46.
+      placeholder: "42"
+      value: "42"
+    validations:
+      required: true
+
+  - type: input
+    id: bundle_hash
+    attributes:
+      label: Bundle hash
+      description: Paste a hash that identifies the exact bundle. Either `manifest.json` → `bundle_hash` (if your build records one), or a `sha256sum tasks/converted_within_90_days/test.parquet` from your local checkout. This makes the report unambiguous if we regenerate.
+      placeholder: "sha256:abc123… (or manifest.json bundle_hash)"
+    validations:
+      required: true
+
+  - type: dropdown
+    id: triage_label
+    attributes:
+      label: Suggested triage label
+      description: Best guess from the [triage table](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md#triage-labels) at the top of the break-me guide. The maintainer applies the final label.
+      options:
+        - critical-leakage
+        - difficulty
+        - documentation
+        - notebook
+        - pedagogy
+        - platform
+        - v2-idea
+        - out-of-scope-v1
+        - I don't know
+    validations:
+      required: true
+
+  - type: dropdown
+    id: severity
+    attributes:
+      label: Severity (your assessment)
+      description: |
+        - **high**: blocks v1 release or downstream usage (label reconstructed via undocumented path; notebook fails on a clean checkout).
+        - **medium**: meaningful for downstream users but workaround exists (segment-conditional miscalibration; ties at threshold inflate slate).
+        - **low**: cosmetic or pedagogical (documentation drift, missing footnote).
+      options:
+        - high
+        - medium
+        - low
+    validations:
+      required: true
+
+  - type: textarea
+    id: summary
+    attributes:
+      label: Summary
+      description: One paragraph. What did you find?
+      placeholder: |
+        On the intermediate tier I trained a HistGBM with `account_avg_touches` joined in via account_id and got AUC 0.97 vs the reported 0.886. The split has account_id collisions across train/test, so the engineered feature is leaking the test labels through the account.
+    validations:
+      required: true
+
+  - type: textarea
+    id: repro
+    attributes:
+      label: Minimal reproduction
+      description: |
+        The smallest code or command sequence that reproduces the finding. Prefer runnable Python / shell over screenshots. If it's a notebook reproduction, name the notebook and section.
+      placeholder: |
+        ```python
+        # Run from release/notebooks/ against intermediate/
+        train = pd.read_parquet("../intermediate/tasks/converted_within_90_days/train.parquet")
+        test = pd.read_parquet("../intermediate/tasks/converted_within_90_days/test.parquet")
+        overlap = set(train["account_id"]) & set(test["account_id"])
+        print(f"shared accounts: {len(overlap)}")  # > 0 ⇒ this template applies
+        ```
+      render: markdown
+    validations:
+      required: true
+
+  - type: textarea
+    id: expected_actual
+    attributes:
+      label: Expected vs actual
+      description: What did you expect, and what did you observe? Cite a `validation_report.json` JSON path or a notebook tolerance gate if relevant.
+      placeholder: |
+        Expected: `tiers.intermediate.medians.gbm_auc` ≈ 0.876 (validation_report.json).
+        Actual: 0.97 with the engineered feature, dropping to 0.88 under GroupKFold(account_id).
+    validations:
+      required: true
+
+  - type: textarea
+    id: environment
+    attributes:
+      label: Environment
+      description: Anything reproducibility-relevant — Python version, package versions, OS. `pip freeze | grep -E "leadforge|scikit|pandas"` is enough for most reports.
+      placeholder: |
+        leadforge==1.0.0
+        scikit-learn==1.5.0
+        pandas==2.2.2
+        Python 3.11.9, macOS 14.5
+      render: text
+    validations:
+      required: false
+
+  - type: checkboxes
+    id: confirmations
+    attributes:
+      label: Confirmations
+      options:
+        - label: I read the [break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) and this finding isn't already an explicit v1 simplification documented there or in the dataset card.
+          required: true
+        - label: I'm reporting a finding on the as-shipped public bundle, not on a privately modified copy. (If you modified the bundle, open a `realism` issue or a discussion instead.)
+          required: true
diff --git a/.github/ISSUE_TEMPLATE/realism_feedback.yml b/.github/ISSUE_TEMPLATE/realism_feedback.yml
new file mode 100644
index 0000000..07091f2
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/realism_feedback.yml
@@ -0,0 +1,116 @@
+name: Realism feedback
+description: A modelled distribution in leadforge-lead-scoring-v1 doesn't match what a domain expert would expect — industry mix, persona behaviour, funnel timing, channel attribution, pricing, etc. For *bundle-level* findings (leakage, contamination, ranking inversions), use the breakage report instead.
+title: "[realism] "
+labels: ["dataset: leadforge-lead-scoring-v1", "realism", "needs-triage"]
+body:
+  - type: markdown
+    attributes:
+      value: |
+        Realism feedback is one of the highest-value report types we can receive — the simulator can be calibrated against any concrete observation about the real B2B procurement world. Examples we welcome: "industry distribution overweights manufacturing", "VP Finance personas convert too fast", "expected_acv is uncorrelated with industry but should track it strongly".
+
+        We're particularly interested in findings backed by domain experience or public data. The dataset card already documents some intentional simplifications; please skim [`release/README.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/README.md) and [`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md) before filing — if the issue is already listed under "Known limitations" or "Out-of-scope uses", a thumbs-up on an existing issue is more useful than a new one.
+
+  - type: dropdown
+    id: aspect
+    attributes:
+      label: Which aspect of the dataset?
+      options:
+        - industry mix / firmographics
+        - persona behaviour / personographics
+        - funnel timing (snapshot day, conversion horizon, stage progression)
+        - channel attribution (lead_source, first_touch_channel, conversion-by-channel)
+        - pricing / ACV distribution
+        - account-to-lead density (multi-lead accounts, lead/account ratio)
+        - regional distribution (US/UK split, regional conversion differences)
+        - other (please describe in the claim)
+    validations:
+      required: true
+
+  - type: dropdown
+    id: tier
+    attributes:
+      label: Which tier(s) does this affect?
+      description: Realism findings often hold across all tiers because tiers differ in *signal strength*, not in the underlying simulator. Pick `all` if you didn't specifically check one tier.
+      options:
+        - intro
+        - intermediate
+        - advanced
+        - instructor companion
+        - all (didn't check tier-specifically)
+    validations:
+      required: true
+
+  - type: textarea
+    id: domain_experience
+    attributes:
+      label: Domain experience (one-line)
+      description: A short note on why your perspective is informed — "AE at a procurement SaaS for 6 years", "academic study of B2B funnel CR", "ex-RevOps at a $50M ARR mid-market vendor". This isn't gatekeeping; it helps the maintainer weight the finding when several land in parallel.
+      placeholder: "5y experience as RevOps lead at a procurement SaaS in the same ARR band as Veridian Procure."
+    validations:
+      required: true
+
+  - type: textarea
+    id: claim
+    attributes:
+      label: Claim
+      description: What does the dataset get wrong, and what would you expect instead? One paragraph.
+      placeholder: |
+        The intermediate tier shows a 22% conversion rate on the manufacturing industry slice, identical (within noise) to the rate on healthcare and retail. In real procurement / AP automation, manufacturing typically converts ~1.5x healthcare because manufacturing already has discrete-item AP volume that benefits more from automation. The dataset's conversion rate should differ across industries by 1.3-2x, not be flat.
+    validations:
+      required: true
+
+  - type: textarea
+    id: data_observation
+    attributes:
+      label: Data observation supporting the claim
+      description: |
+        Evidence from the bundle, public benchmarks, or your own data. A pandas snippet that reads the bundle and prints the relevant numbers is ideal.
+      placeholder: |
+        ```python
+        leads = pd.read_parquet("intermediate/tables/leads.parquet")
+        accts = pd.read_parquet("intermediate/tables/accounts.parquet")
+        tasks = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet")
+        joined = tasks.merge(leads[["lead_id", "account_id"]], on="lead_id").merge(
+            accts[["account_id", "industry"]], on="account_id"
+        )
+        print(joined.groupby("industry")["converted_within_90_days"].mean())
+        # All industries within 0.02 of 0.22 — flat across industry.
+        ```
+      render: markdown
+    validations:
+      required: true
+
+  - type: textarea
+    id: suggested_fix
+    attributes:
+      label: Suggested fix (optional)
+      description: How would you change the simulator to match? Even a rough direction ("make `industry` modulate `ConversionHazard` weights") helps. Leave blank if you'd rather flag the problem than prescribe the fix.
+      placeholder: |
+        Add a per-industry scalar to `ConversionHazard` in `leadforge/mechanisms/hazards.py` keyed off `accounts.industry`, so the manufacturing path sees ~1.5x base hazard relative to the modal industry. Acceptance band would live in `v1_acceptance_gates_bands.yaml`.
+    validations:
+      required: false
+
+  - type: dropdown
+    id: severity
+    attributes:
+      label: Severity (your assessment)
+      description: |
+        - **high**: the dataset's pedagogy is materially misleading — students learn a wrong intuition about the real domain.
+        - **medium**: the gap is real but the lesson still transfers (e.g. flat per-industry rates teach baseline discipline even if production has more variance).
+        - **low**: cosmetic / footnote.
+      options:
+        - high
+        - medium
+        - low
+    validations:
+      required: true
+
+  - type: checkboxes
+    id: confirmations
+    attributes:
+      label: Confirmations
+      options:
+        - label: I checked [`release/README.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/README.md) "Known limitations" / "Out-of-scope uses" and this finding isn't already explicitly documented.
+          required: true
+        - label: I checked [`docs/release/post_v1_roadmap.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/post_v1_roadmap.md) and the existing [v2 decision log](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md), and this isn't already an accepted v2 work item.
+          required: true
diff --git a/docs/release/break_me_guide.md b/docs/release/break_me_guide.md
new file mode 100644
index 0000000..3f8e844
--- /dev/null
+++ b/docs/release/break_me_guide.md
@@ -0,0 +1,341 @@
+# Break Me — adversarial playbook for `leadforge-lead-scoring-v1`
+
+We *want* this dataset to be broken on purpose. The notebooks
+ship the headline walkthroughs (notebook 03 dissects the
+documented `total_touches_all` trap; notebook 04 covers
+calibration, value-aware ranking, and cohort shift). This guide
+is the **meta-recipe**: the patterns to look for on any
+synthetic teaching dataset, with worked-example pointers back
+into the v1 bundle so each pattern is grounded in a number
+you can reproduce.
+
+If you find one of these on `leadforge-lead-scoring-v1`,
+file an issue using one of the templates in
+[`.github/ISSUE_TEMPLATE/`](https://github.com/leadforge-dev/leadforge/tree/main/.github/ISSUE_TEMPLATE).
+Accepted findings are logged in
+[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).
+
+## Triage labels
+
+When you file an issue, suggest one of these labels in the
+title or body. The maintainer applies the final label.
+
+| Label | When |
+|---|---|
+| `critical-leakage` | The dataset reconstructs the label via a path that wasn't documented. Highest priority — blocks v1 if reproducible on the as-shipped bundle. |
+| `realism` | A modelled distribution disagrees with what a domain expert expects (industry mix, persona behaviour, funnel timing, channel attribution, pricing). Belongs in the realism issue template. |
+| `difficulty` | A tier sits outside its declared band on a metric documented in `release/validation/validation_report.md`. Likely a band recalibration in v2. |
+| `documentation` | A claim in the dataset card or notebooks doesn't match the artefact. Cheap to fix; please file. |
+| `platform` | Kaggle / HF artefact issue (broken link, malformed YAML, schema mismatch). Phase 5 territory. |
+| `notebook` | A notebook fails to execute, or its tolerance gate fires on a fresh checkout. |
+| `pedagogy` | The teaching framing is misleading even though the artefact is technically correct. |
+| `v2-idea` | A capability worth adding (cohort drift, channel-conditional probabilities, non-linear motifs). |
+| `out-of-scope-v1` | True observation, but explicitly deferred — the dataset card already documents it as a v1 simplification. |
+
+## The meta-recipe
+
+Notebook 03 §7 introduces a three-step recipe (read the feature
+dictionary → ablate, don't just probe → check the time window).
+This guide extends it with one more step that the notebook
+doesn't cover, then organises the patterns to apply each step
+to.
+
+1. **Read the feature dictionary first.** Every public bundle
+   ships `feature_dictionary.csv` with a `leakage_risk` column.
+   Treat that as the primary leakage audit before any modelling.
+2. **Ablate, don't just probe.** A standalone-AUC probe on a
+   single feature can rate a column as ~0.5 AUC while a tree
+   model extracts non-trivial lift from the same column once
+   it can combine it with the rest of the panel. Notebook 03
+   §4–§5 demonstrate the gap on `total_touches_all`
+   (standalone 0.531 → GBM lift +0.032 vs LR lift +0.009).
+3. **Check the time window.** If you have any event table
+   with timestamps, cross-check every aggregate feature against
+   `lead_created_at + snapshot_day`. The validation report's
+   `post_snapshot_aggregates` baseline (`$.tiers.intermediate.per_seed[*].baselines.post_snapshot_aggregates`)
+   bench-tests this same idea at scale.
+4. **Treat the train/test split as untrusted.** The split file
+   says one thing; what the model sees during fitting is what
+   matters. Sections 5 and 6 below cover the most common ways
+   the two diverge.
+
+The pattern catalogue below maps each pattern to the recipe
+step it operationalises.
+
+---
+
+## Leakage patterns
+
+### 1. Naming smells the dictionary should already flag
+
+A column whose name mentions `total`, `all`, `lifetime`,
+`final`, `outcome`, or any superlative that crosses the
+prediction horizon is suspicious by default on a snapshot-
+anchored task. `leadforge-lead-scoring-v1` ships exactly one
+such column — `total_touches_all` — and the
+`feature_dictionary.csv` row for it sets `leakage_risk=True`
+and explains *why* in the description.
+
+**How to detect on any dataset.** Grep the column list for
+`*_total`, `*_all`, `*_lifetime`, `*_final`, `*_outcome`,
+`current_*`, `is_*` (especially `is_won`, `is_closed`).
+Cross-check each hit against the dataset's stated prediction
+horizon and snapshot anchor. If the column name implies a
+window the snapshot can't have observed, the dictionary should
+either flag it or rename it; if neither, that's a `documentation`
+issue at minimum and probably `critical-leakage`.
+
+**Worked example.** Notebook 03 §2 shows the dictionary read
+in three lines of pandas; the column it surfaces is
+`total_touches_all`.
+
+### 2. The standalone-AUC undersell (tree-friendly leakage)
+
+A feature can score ~0.5 AUC as a single-column ranker and
+still hand a tree model material lift once interactions with
+other columns are available. The validation report's
+`post_snapshot_aggregates` baseline (a fitted LR on the trap
+column alone) gives ~0.55 AUC — the trap "looks" innocuous on
+a standalone audit. Notebook 03 §5 then runs a full panel
+ablation and HistGBM extracts +0.032 AUC; LR with the same
+preprocessing only extracts +0.009 because it can't represent
+the relevant interaction.
+
+**How to detect on any dataset.** Don't audit leakage with
+single-feature AUC. For every column you flagged in pattern 1,
+fit two tree models on the same train/test split — one with
+the column, one without — and read the AUC delta. A delta
+larger than your sampling noise is a flag, regardless of the
+standalone number.
+
+**Worked example.** Notebook 03 §4 (standalone) and §5
+(ablation), with the side-by-side bar chart in §5.1. The
+sign-aware tolerance gate in §6 (`MIN_GBM_LIFT = 0.015`)
+formalises the asymmetry as a CI assertion.
+
+### 3. Time-window violations on engineered features
+
+The non-negotiable rule: no feature on a snapshot-anchored
+task may use events later than `lead_created_at + snapshot_day`.
+The public bundle's event tables (`touches`, `sessions`,
+`sales_activities`, `opportunities`) are pre-filtered to
+satisfy this rule (notebook 02 §3 verifies the contract on
+the bundle as shipped, including a *minimum headroom under
+cutoff* readout). The hazard you can still create yourself is
+to engineer a feature that joins back to a non-event table
+without filtering — for instance, joining `customers` (which
+exists only for *converted* leads) into a feature panel.
+
+**How to detect on any dataset.** For every per-lead
+aggregate you build, write the query as `SELECT … WHERE
+event.timestamp <= lead.created_at + INTERVAL '<snapshot_day>'`
+explicitly, even when the underlying table is already filtered.
+If the same SQL works against the instructor companion (full-
+horizon tables) AND the public bundle, you'll catch
+yourself if you accidentally rely on rows that exist only in
+the unfiltered view.
+
+**Worked example.** Notebook 02 §3 implements the per-table
+inline assertion. The validation report's
+`$.tiers.<tier>.per_seed[*].baselines.post_snapshot_aggregates`
+HistGBM AUC documents what a model can recover when the rule
+is intentionally violated.
+
+### 4. Target-encoding leakage on test
+
+Mean-target encoding of a categorical feature is a textbook
+hazard: fit the encoding on the *full* train+test population
+and you've leaked test labels into the feature. Notebook 02
+§4.4 demonstrates the train-only-fit posture on `industry`
+(four industries — logistics, healthcare_non_clinical,
+manufacturing, professional_services — encoded by their
+training-split conversion rate, with a global-mean fallback
+for industries not seen in train). The leakage version is a one-line change — using
+`pd.concat([train, test]).groupby('industry')['target'].mean()`
+instead — and we deliberately *don't* show that in the
+notebook because the lesson is the discipline, not the trap.
+
+**How to detect on any dataset.** When mean-target encoding
+shows up in a notebook or pipeline, check three things in
+order: (a) the encoding's `.fit()` call sees only training
+labels; (b) the same encoding is applied to test via merge
+or join, never re-fitted; (c) categories present in test but
+not train fall back to a deterministic value (global mean is
+fine; computing a fallback from test is not). If the encoding
+is fit on test labels even partially — including via a
+"smoothed" encoder that uses pooled train+test counts — you
+have target leakage.
+
+**Worked example.** Notebook 02 §4.4 (train-only fit) and
+§4.5 (the merge that applies the encoding to test). The
+fallback-to-train-mean handling is in `attach_engineered`.
+
+---
+
+## Split discipline
+
+### 5. Train-test contamination
+
+The bundle ships a deterministic 70/15/15 split on `lead_id`
+(see `tasks/<task>/task_manifest.json`). That guarantees
+`lead_id` uniqueness across splits — but `account_id` is
+*not* split on. Two leads in the same account can land in
+train and test, and the model can ride strong account-level
+signal across the split boundary in ways that don't generalise
+to a fresh account.
+
+**How to detect on any dataset.** Compute the intersection
+of `account_id` (or whatever the per-entity grouping key is)
+between train and test. If it's non-empty *and* you've
+engineered any account-level features, retrain with
+account-level grouped splitting (e.g. `GroupKFold` on
+`account_id`) and re-read the AUC delta. The delta is the
+amount of "free" lift the random-split was buying you. The
+right framing isn't "remove the leak"; it's *report both
+numbers so the reader knows which is which.*
+
+**Worked example.** Notebook 02 §4.2 builds an account-level
+density feature using *only* train leads' touches — a
+defensive posture against this hazard. The
+`tasks/converted_within_90_days/task_manifest.json` records
+the split policy and is the right artefact to cite when filing
+an issue under this label. A bundle-level `account_id`
+overlap audit isn't included in v1 — the validation report's
+split-leakage probe (`probe_split_id_overlap`) checks
+`lead_id` only.
+
+### 6. Cohort-by-segment evaluation
+
+Notebook 04 §7 demonstrates **tier-wide** cohort shift —
+sort leads chronologically, train on the first 85 %, score
+the last 15 % — and finds intermediate cohort-split AUC
+sits *higher* than random-split AUC by ~0.0155 (the v1
+simulator has no time drift baked in over the 90-day horizon).
+The richer stress test is **per-segment** cohort shift:
+chronological resplit *within* each industry, region, or
+revenue tier, and read the same delta per segment. Segment-
+conditional drift can hide inside a stable tier-wide number
+— industry A drifting up by 0.04 cancels industry B drifting
+down by 0.04 in the average.
+
+**How to detect on any dataset.** For each segment column
+(`industry`, `region`, `employee_band`,
+`estimated_revenue_band`), repeat the cohort-split protocol
+from notebook 04 §7 conditioned on that segment. Report the
+per-segment AUC degradation and the spread across segments.
+A spread larger than your tier-wide cross-seed band
+(`$.tiers.<tier>.spreads.lr_auc`) is a realism flag — the
+simulator is producing a homogeneous world that real
+production cohorts wouldn't be.
+
+**Worked example.** Notebook 04 §7 (tier-wide, validator-
+mirrored). The validation report's `cohort_shift.<tier>.auc_degradation`
+field gives the v1 baseline you're trying to refine. v1
+intentionally runs only the tier-wide check; the per-segment
+audit is a `v2-idea` candidate.
+
+---
+
+## Metric and ranking traps
+
+### 7. Value-aware ranking surprises
+
+P(convert) ranking and `P(convert) × expected_acv` ranking
+are both reasonable depending on the operational question.
+Notebook 04 §5 shows the gap on this bundle — at top-50, ACV
+capture jumps from 0.16 (P-only) to 0.40 (P × ACV). The trap
+is reaching for one metric when the operational question
+demands the other and not noticing the inversion. AUC ranks
+*everything* by P(convert); a salesperson with capacity for
+50 leads cares about revenue-weighted top-50 capture.
+
+**How to detect on any dataset.** Compute both `precision_at_k`
+and `expected_acv_capture_at_k` for the same top-K. If their
+ranking of model variants disagrees, that's a finding — at
+minimum a `pedagogy` issue, possibly `realism` if the gap is
+so large it suggests the simulator's ACV column has unrealistic
+correlation with P(convert).
+
+**Worked example.** Notebook 04 §5 produces both curves
+side-by-side; the validation report's
+`$.tiers.<tier>.per_seed[*].expected_acv_capture_at_k`
+gives the canonical numbers across seeds.
+
+### 8. Threshold-vs-rank semantics
+
+A `precision >= threshold` operating point and a `top-K by
+rank` operating point are not the same thing when probabilities
+have ties. Notebook 04 §6 picks a threshold that "should"
+admit 50 leads and reads back `actually_above` to surface
+when ties at the operating point inflate the slate beyond
+capacity. On a fresh seed this can quietly admit 70+ leads
+into a 50-lead capacity plan if several ties sit at the
+chosen probability.
+
+**How to detect on any dataset.** When you set a probability
+threshold for a fixed-capacity decision, always log the
+*realised* count above threshold, not just the threshold value.
+If realised > capacity by more than a few percent, ties are
+inflating the slate and you need either a finer probability
+grid (less likely to help on a calibrated model) or a
+secondary rank score to break ties.
+
+**Worked example.** Notebook 04 §6 prints
+`capacity / threshold / actually_above / precision / recall`
+and walks through the threshold sweep for context. The
+calibration-bin output in §3 is the related receipt — a model
+with poor bin-error is more likely to have ties at common
+probabilities.
+
+---
+
+## Robustness and realism
+
+### 9. Calibration drift across cohorts and segments
+
+The validation report tracks `calibration_max_bin_error`
+per tier (`$.tiers.<tier>.medians.calibration_max_bin_error`)
+— intermediate ~0.25, intro ~0.25, advanced ~0.52. That's a
+single number per tier on a single split; it can mask
+segment-conditional miscalibration where the model is
+well-calibrated overall but consistently over-predicts on
+small-revenue accounts and under-predicts on large ones, or
+drifts late-in-cohort vs early. Notebook 04 §3 shows the
+tier-level reliability diagram on the public bundle; the
+analogous per-segment diagram is the next stress test.
+
+**How to detect on any dataset.** Reproduce notebook 04 §3's
+binning protocol *within* each segment column you care about
+(`industry`, `region`, `employee_band`,
+`estimated_revenue_band`). Report `max_bin_error` per segment
+and the spread across segments. A segment whose max-bin-error
+is materially worse than the tier-level number is a `realism`
+finding — the world isn't producing the correlation structure
+between segment and outcome that real production data would.
+
+**Worked example.** Notebook 04 §3 covers the tier-level
+case end-to-end. The cohort-shift block in §7 is the
+chronological analogue (calibration over time, in
+expectation, via AUC degradation as a coarse summary). v1
+doesn't ship a per-segment calibration audit; it's a
+`v2-idea`.
+
+---
+
+## What to do when you find one
+
+1. Reproduce the finding from a clean checkout against the
+   as-shipped bundle. Note the seed, tier, and `manifest.json`
+   `bundle_hash` (or a freshly computed file hash if your
+   build doesn't expose one).
+2. Pick the issue template that fits — leakage / contamination
+   / metric findings go in [`dataset_breakage_report.yml`](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml);
+   distributional / realism critiques go in
+   [`realism_feedback.yml`](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml).
+3. Suggest a triage label from the table at the top of this
+   guide. The maintainer applies the final label.
+4. Watch [`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md)
+   for the disposition. Accepted findings get an entry with
+   a verdict (`accepted-for-v2`, `deferred`, `wont-fix`,
+   `needs-investigation`) and a pointer to the resulting v2
+   work item.
diff --git a/docs/release/v2_decision_log.md b/docs/release/v2_decision_log.md
new file mode 100644
index 0000000..6590775
--- /dev/null
+++ b/docs/release/v2_decision_log.md
@@ -0,0 +1,38 @@
+# v2 Decision Log — `leadforge-lead-scoring-v2`
+
+This log tracks every external finding against
+`leadforge-lead-scoring-v1` and the disposition the maintainer
+took on each one. It exists so a contributor in 2027 can see
+*why* a v2 design call was made (or why a v1 quirk was kept).
+
+The log starts empty. The first real entry will be added when
+the first issue lands; the schema below is what that entry
+will fill in.
+
+## Schema
+
+Each row is one disposition. Add new rows at the bottom; never
+edit historical entries.
+
+| Field | Required | Format | Notes |
+|---|---|---|---|
+| `received_at` | yes | `YYYY-MM-DD` | Date the finding was received (issue opened / reviewer comment / direct message). Use the wall-clock date in the maintainer's timezone. |
+| `source` | yes | one of `issue:#NNN`, `pr:#NNN`, `email`, `direct` | Where the finding came in. `issue` and `pr` link via the GitHub number. |
+| `topic` | yes | one short phrase | What the finding is about — e.g. "expected_acv realism", "industry conversion rates", "cohort-by-segment drift". |
+| `severity` | yes | `low` / `medium` / `high` | Reporter's claim, sanity-checked by the maintainer. `high` is the equivalent of the breakage-report `high` severity tier. |
+| `verdict` | yes | one of `accepted-for-v2`, `deferred`, `wont-fix`, `needs-investigation` | See vocabulary below. |
+| `next_step` | yes | one sentence | What concretely happens next (or has happened). Free-form but specific — "tracked in v2 milestone as #NNN", "documented as v1 simplification in dataset card", etc. |
+| `link` | optional | URL or path | Pointer to the resulting commit, doc change, or v2 work item. Empty for `wont-fix` and `needs-investigation`. |
+
+### Verdict vocabulary
+
+| Verdict | When |
+|---|---|
+| `accepted-for-v2` | The finding is real and the fix lands in v2. There should be a linked v2 milestone work item. |
+| `deferred` | The finding is real but the fix is post-v2 (or unsized). Counts as a backlog entry, not a v2 commitment. |
+| `wont-fix` | The finding is correct but the design call is intentional. The dataset card or roadmap should already document it; if not, the entry should result in a doc update. |
+| `needs-investigation` | The finding is plausible but not yet reproduced or scoped. Stays in this state for at most one cycle; the maintainer must promote it to one of the other three verdicts before declaring v2 ready. |
+
+## Log
+
+(no entries yet — first entry lands when the first external finding is received)
diff --git a/release/README.md b/release/README.md
index cb1329c..d27b608 100644
--- a/release/README.md
+++ b/release/README.md
@@ -211,11 +211,19 @@ intended difficulty axis (intro > intermediate > advanced).
 
 ## Maintenance, adversarial framing, license
 
-We *want* the dataset to be broken. Issue templates ship under
-`.github/ISSUE_TEMPLATE/` (Phase 6); the break-me guide lands as
-`docs/release/break_me_guide.md` (PR 6.3). Once Phase 6 ships,
-`docs/release/v2_decision_log.md` will track every accepted finding
-and the design call that came from it. File issues at
+We *want* the dataset to be broken. The
+[break-me guide](../docs/release/break_me_guide.md) catalogues
+nine adversarial patterns to look for (leakage, split
+contamination, ranking inversions, calibration drift) with
+worked-example pointers back into the notebooks. Issue
+templates ship under `.github/ISSUE_TEMPLATE/`: a
+[breakage report](../.github/ISSUE_TEMPLATE/dataset_breakage_report.yml)
+form for findings on the bundle itself, and a
+[realism feedback](../.github/ISSUE_TEMPLATE/realism_feedback.yml)
+form for distributional critiques. Accepted findings are
+logged in
+[`docs/release/v2_decision_log.md`](../docs/release/v2_decision_log.md).
+File issues at
 [leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);
 PRs welcome.
 
diff --git a/release/huggingface/README.md b/release/huggingface/README.md
index 885e834..ca0ecd1 100644
--- a/release/huggingface/README.md
+++ b/release/huggingface/README.md
@@ -256,11 +256,19 @@ intended difficulty axis (intro > intermediate > advanced).
 
 ## Maintenance, adversarial framing, license
 
-We *want* the dataset to be broken. Issue templates ship under
-`.github/ISSUE_TEMPLATE/` (Phase 6); the break-me guide lands as
-`docs/release/break_me_guide.md` (PR 6.3). Once Phase 6 ships,
-`docs/release/v2_decision_log.md` will track every accepted finding
-and the design call that came from it. File issues at
+We *want* the dataset to be broken. The
+[break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues
+nine adversarial patterns to look for (leakage, split
+contamination, ranking inversions, calibration drift) with
+worked-example pointers back into the notebooks. Issue
+templates ship under `.github/ISSUE_TEMPLATE/`: a
+[breakage report](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml)
+form for findings on the bundle itself, and a
+[realism feedback](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml)
+form for distributional critiques. Accepted findings are
+logged in
+[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).
+File issues at
 [leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);
 PRs welcome.
 
diff --git a/release/kaggle/dataset-metadata.json b/release/kaggle/dataset-metadata.json
index 2f4b9b2..6f1dab4 100644
--- a/release/kaggle/dataset-metadata.json
+++ b/release/kaggle/dataset-metadata.json
@@ -1,6 +1,6 @@
 {
   "collaborators": [],
-  "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting *which* leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.\n\n[^macro]: Macroeconomic framing summarised in\n[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier\n│   ├── manifest.json                 # provenance + file hashes\n│   ├── dataset_card.md               # auto-rendered per-bundle card\n│   ├── feature_dictionary.csv        # authoritative column spec\n│   ├── lead_scoring.csv              # flat convenience CSV (all splits)\n│   ├── tables/*.parquet              # 7 snapshot-safe relational tables\n│   └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── dataset-metadata.json             # Kaggle dataset metadata\n├── dataset-cover-image.png           # Kaggle cover image\n├── README.md                         # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest  = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads   = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n    touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n#                    --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Dataset summary\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (recipe band) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier |\n|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 |\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended difficulty axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n  designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n  (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n  fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n  calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n  The instructor companion exposes the hidden graph for teaching, not\n  designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n  attributes.\n\n## Known limitations\n\n- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across\n  every tier. Difficulty is visible in AP, P@K, Brier, and value\n  capture. Treat AUC as a sanity check, not a difficulty signal.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n  is slightly negative in every tier (intro −0.0045, intermediate\n  −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n  features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n  [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n  out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n  all tiers and the per-channel rate spread is ≤0.05. The simulator\n  does not encode channel-conditional probabilities; channel-conditional\n  encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n  baked in; the cohort-shift gate (G6.4) is informational and will\n  bite in v2.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n  sales_activities, opportunities (public); plus customers and\n  subscriptions (instructor only). Per-row counts per bundle live in\n  `manifest.json`.\n- **Features.** 32 public columns grouped by analytical role in\n  [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n  the per-bundle `feature_dictionary.csv` is the authoritative\n  machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n  the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n  recorded in `tasks/converted_within_90_days/task_manifest.json`.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n  version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. Issue templates ship under\n`.github/ISSUE_TEMPLATE/` (Phase 6); the break-me guide lands as\n`docs/release/break_me_guide.md` (PR 6.3). Once Phase 6 ships,\n`docs/release/v2_decision_log.md` will track every accepted finding\nand the design call that came from it. File issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate <bundle_dir>`; every file\nis hashed in `manifest.json`.\n",
+  "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting *which* leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.\n\n[^macro]: Macroeconomic framing summarised in\n[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier\n│   ├── manifest.json                 # provenance + file hashes\n│   ├── dataset_card.md               # auto-rendered per-bundle card\n│   ├── feature_dictionary.csv        # authoritative column spec\n│   ├── lead_scoring.csv              # flat convenience CSV (all splits)\n│   ├── tables/*.parquet              # 7 snapshot-safe relational tables\n│   └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── dataset-metadata.json             # Kaggle dataset metadata\n├── dataset-cover-image.png           # Kaggle cover image\n├── README.md                         # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest  = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads   = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n    touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n#                    --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Dataset summary\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (recipe band) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier |\n|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 |\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended difficulty axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n  designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n  (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n  fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n  calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n  The instructor companion exposes the hidden graph for teaching, not\n  designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n  attributes.\n\n## Known limitations\n\n- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across\n  every tier. Difficulty is visible in AP, P@K, Brier, and value\n  capture. Treat AUC as a sanity check, not a difficulty signal.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n  is slightly negative in every tier (intro −0.0045, intermediate\n  −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n  features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n  [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n  out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n  all tiers and the per-channel rate spread is ≤0.05. The simulator\n  does not encode channel-conditional probabilities; channel-conditional\n  encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n  baked in; the cohort-shift gate (G6.4) is informational and will\n  bite in v2.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n  sales_activities, opportunities (public); plus customers and\n  subscriptions (instructor only). Per-row counts per bundle live in\n  `manifest.json`.\n- **Features.** 32 public columns grouped by analytical role in\n  [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n  the per-bundle `feature_dictionary.csv` is the authoritative\n  machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n  the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n  recorded in `tasks/converted_within_90_days/task_manifest.json`.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n  version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. The\n[break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under `.github/ISSUE_TEMPLATE/`: a\n[breakage report](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml)\nform for findings on the bundle itself, and a\n[realism feedback](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml)\nform for distributional critiques. Accepted findings are\nlogged in\n[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).\nFile issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate <bundle_dir>`; every file\nis hashed in `manifest.json`.\n",
   "expectedUpdateFrequency": "never",
   "id": "leadforge/leadforge-lead-scoring-v1",
   "image": "dataset-cover-image.png",
diff --git a/release/notebooks/03_leakage_and_time_windows.ipynb b/release/notebooks/03_leakage_and_time_windows.ipynb
index 2130369..3f81625 100644
--- a/release/notebooks/03_leakage_and_time_windows.ipynb
+++ b/release/notebooks/03_leakage_and_time_windows.ipynb
@@ -398,7 +398,7 @@
    "cell_type": "markdown",
    "id": "cell_018",
    "metadata": {},
-   "source": "## 7. A detection recipe you can run on any dataset\n\nThe trap was easy to spot here because the dataset\n*advertises* it. On a third-party dataset you don't get\nthat courtesy. The same recipe still works:\n\n1. **Read any feature dictionary you have.** Any column\n   whose description references a window longer than the\n   prediction horizon is suspicious. Even when no\n   dictionary ships, an obvious naming smell (`*_total`,\n   `*_all`, `*_lifetime`) on a 30-day-snapshot dataset is a\n   flag.\n2. **Probe the standalone AUC** *and* **the contribution to\n   a tree model.** A standalone probe alone undersells\n   tree-friendly leakage (sections 4 and 5 demonstrate why\n   on this dataset). Train a model with the column, train\n   another without, and compare. The ablation captures\n   interactions the standalone probe can't.\n3. **Inspect the time window.** Cross-check the suspect\n   column against any time-stamped event tables. If the\n   column's value can only be explained by events past the\n   snapshot anchor, you've found a trap. Section 3 makes\n   this concrete here — the same technique generalises\n   anywhere there's an event table to corroborate.\n\nA walkthrough of additional detection patterns\n(column-name heuristics, isolation-via-residuals,\ntarget-encoding leakage on test) lives in\n`docs/release/break_me_guide.md` (coming in PR 6.3) — pair\nit with this notebook for a more complete playbook.\n\n## Next\n\n- **Notebook 04** — value-aware ranking\n  (`expected_acv` × P(convert)), calibration plots,\n  threshold selection for top-K capacity, and a\n  cohort-shift / bootstrap robustness harness."
+   "source": "## 7. A detection recipe you can run on any dataset\n\nThe trap was easy to spot here because the dataset\n*advertises* it. On a third-party dataset you don't get\nthat courtesy. The same recipe still works:\n\n1. **Read any feature dictionary you have.** Any column\n   whose description references a window longer than the\n   prediction horizon is suspicious. Even when no\n   dictionary ships, an obvious naming smell (`*_total`,\n   `*_all`, `*_lifetime`) on a 30-day-snapshot dataset is a\n   flag.\n2. **Probe the standalone AUC** *and* **the contribution to\n   a tree model.** A standalone probe alone undersells\n   tree-friendly leakage (sections 4 and 5 demonstrate why\n   on this dataset). Train a model with the column, train\n   another without, and compare. The ablation captures\n   interactions the standalone probe can't.\n3. **Inspect the time window.** Cross-check the suspect\n   column against any time-stamped event tables. If the\n   column's value can only be explained by events past the\n   snapshot anchor, you've found a trap. Section 3 makes\n   this concrete here — the same technique generalises\n   anywhere there's an event table to corroborate.\n\nA walkthrough of additional detection patterns\n(column-name heuristics, target-encoding leakage on\ntest, train-test contamination via account_id,\ncohort-by-segment evaluation) lives in\n[`docs/release/break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) —\npair it with this notebook for a more complete\nplaybook.\n\n## Next\n\n- **Notebook 04** — value-aware ranking\n  (`expected_acv` × P(convert)), calibration plots,\n  threshold selection for top-K capacity, and a\n  cohort-shift / bootstrap robustness harness."
   }
  ],
  "metadata": {
diff --git a/release/notebooks/04_lift_calibration_value_ranking.ipynb b/release/notebooks/04_lift_calibration_value_ranking.ipynb
index 25933be..43c1646 100644
--- a/release/notebooks/04_lift_calibration_value_ranking.ipynb
+++ b/release/notebooks/04_lift_calibration_value_ranking.ipynb
@@ -666,7 +666,7 @@
    "cell_type": "markdown",
    "id": "cell_019",
    "metadata": {},
-   "source": "## 10. Summary\n\n* The LR baseline is well-calibrated (max bin error ≈ 0.13\n  on the trap-dropped headline panel, vs ~0.19 on the\n  with-trap panel the validation report tracks) and lifts\n  the top decile to ~2.75× the base rate.\n* Value-aware ranking (P × ACV) captures more revenue per\n  top-K slot than P-only ranking — the gap depends on K\n  but is positive across all sizes we tested.\n* Cohort shift is **negative** on the intermediate tier\n  (the late cohort is *easier*, not harder); the report\n  documents this, and the notebook reproduces it. The\n  intro and advanced tiers show small positive\n  degradations.\n* Bootstrap on the existing test split gives a within-\n  bundle confidence band that's tighter than the cross-seed\n  spread the validation report computes — useful for \"how\n  confident is this single AUC\" questions, not for \"how\n  much does the bundle move across seeds.\"\n\n## Where to go next\n\n1. Try cohort-shifted training in production: refit weekly\n   on the trailing 60-day window, score the next 7 days.\n2. If you have real ACV data, swap the `expected_acv`\n   heuristic for it and recompute section 5 — the revenue\n   capture story should sharpen.\n3. The break-me playbook in `docs/release/break_me_guide.md`\n   (coming in PR 6.3) catalogues additional stress tests\n   (target-encoding leakage, train-test contamination,\n   cohort-by-segment) and how to detect each from a\n   single bundle."
+   "source": "## 10. Summary\n\n* The LR baseline is well-calibrated (max bin error ≈ 0.13\n  on the trap-dropped headline panel, vs ~0.19 on the\n  with-trap panel the validation report tracks) and lifts\n  the top decile to ~2.75× the base rate.\n* Value-aware ranking (P × ACV) captures more revenue per\n  top-K slot than P-only ranking — the gap depends on K\n  but is positive across all sizes we tested.\n* Cohort shift is **negative** on the intermediate tier\n  (the late cohort is *easier*, not harder); the report\n  documents this, and the notebook reproduces it. The\n  intro and advanced tiers show small positive\n  degradations.\n* Bootstrap on the existing test split gives a within-\n  bundle confidence band that's tighter than the cross-seed\n  spread the validation report computes — useful for \"how\n  confident is this single AUC\" questions, not for \"how\n  much does the bundle move across seeds.\"\n\n## Where to go next\n\n1. Try cohort-shifted training in production: refit weekly\n   on the trailing 60-day window, score the next 7 days.\n2. If you have real ACV data, swap the `expected_acv`\n   heuristic for it and recompute section 5 — the revenue\n   capture story should sharpen.\n3. The break-me playbook in\n   [`docs/release/break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md)\n   catalogues additional stress tests (target-encoding\n   leakage, train-test contamination, cohort-by-segment)\n   and how to detect each from a single bundle."
   }
  ],
  "metadata": {
diff --git a/scripts/build_release_notebook_03.py b/scripts/build_release_notebook_03.py
index e811af6..2593876 100644
--- a/scripts/build_release_notebook_03.py
+++ b/scripts/build_release_notebook_03.py
@@ -542,10 +542,12 @@ def fit_score(cols: list[str], *, model: str) -> np.ndarray:
                anywhere there's an event table to corroborate.
 
             A walkthrough of additional detection patterns
-            (column-name heuristics, isolation-via-residuals,
-            target-encoding leakage on test) lives in
-            `docs/release/break_me_guide.md` (coming in PR 6.3) — pair
-            it with this notebook for a more complete playbook.
+            (column-name heuristics, target-encoding leakage on
+            test, train-test contamination via account_id,
+            cohort-by-segment evaluation) lives in
+            [`docs/release/break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) —
+            pair it with this notebook for a more complete
+            playbook.
 
             ## Next
 
diff --git a/scripts/build_release_notebook_04.py b/scripts/build_release_notebook_04.py
index f57bbf8..42cfd4c 100644
--- a/scripts/build_release_notebook_04.py
+++ b/scripts/build_release_notebook_04.py
@@ -843,11 +843,11 @@ def _summary(arr: np.ndarray, name: str) -> None:
             2. If you have real ACV data, swap the `expected_acv`
                heuristic for it and recompute section 5 — the revenue
                capture story should sharpen.
-            3. The break-me playbook in `docs/release/break_me_guide.md`
-               (coming in PR 6.3) catalogues additional stress tests
-               (target-encoding leakage, train-test contamination,
-               cohort-by-segment) and how to detect each from a
-               single bundle.
+            3. The break-me playbook in
+               [`docs/release/break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md)
+               catalogues additional stress tests (target-encoding
+               leakage, train-test contamination, cohort-by-segment)
+               and how to detect each from a single bundle.
             """
         ),
     ]

From c17347873c92954f5ad2640ed44e942413360cfc Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Fri, 8 May 2026 00:11:04 +0300
Subject: [PATCH 2/4] PR 6.3 self-review: factual fixes + label creation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hostile-reviewer pass on the PR 6.3 diff caught seven issues. Fixes:

1. **post_snapshot_aggregates baseline misidentified.** The break-me
   guide patterns 2/3 called the validator's baseline "a fitted LR";
   the actual implementation in
   leadforge/validation/release_quality.py is HistGBM. Fixed, with a
   link to the source so the model identity is verifiable.

2. **manifest.json `bundle_hash` is fictitious.** The real manifest
   has per-file sha256s nested under `tasks.<task>.test_sha256` and
   `tables.<name>.sha256`, no top-level `bundle_hash`. Fixed both
   the break-me guide ("What to do when you find one" §1) and the
   breakage issue template's bundle-hash field; placeholder now
   carries a real hash from the as-shipped intermediate bundle.

3. **Fabricated "70+ leads" in pattern 8.** The slate-inflation
   number was made up — the as-shipped intermediate bundle's
   actually_above readout matches capacity. Reframed as a defensive
   instrument, with a more honest LR-vs-GBM tie-rate observation.

4. **Misleading "instructor" tier dropdown.** Only
   `intermediate_instructor` ships in v1; the unqualified label
   suggested multiple instructor companions. Renamed to
   `intermediate_instructor`.

5. **Underspecified "~0.55 AUC".** Pinned to "intermediate, median
   across seeds 42–46" with the cross-tier range (0.52–0.61) so
   readers know the variance.

6. **Speculative pattern-9 calibration claim.** "Consistently
   over-predicts on small-revenue accounts" was hypothetical, not
   observed. Reframed as "in principle… whether v1 actually
   exhibits such drift is an open question".

7. **Issue templates referenced labels that didn't exist.** Created
   `dataset: leadforge-lead-scoring-v1`, `needs-triage`, `realism`
   so the templates' `labels:` fields actually fire on issue
   submission. (Triage labels in the dropdowns — critical-leakage
   etc. — stay as free-text choices the maintainer applies
   post-triage.)

Net: 1260/1260 tests still pass; ruff + mypy clean; YAML still
validates; no new bundle regeneration. BUNDLE_SCHEMA_VERSION
unchanged at 5.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .../dataset_breakage_report.yml               | 14 ++++--
 docs/release/break_me_guide.md                | 47 +++++++++++--------
 2 files changed, 38 insertions(+), 23 deletions(-)

diff --git a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml
index 0923386..611d0d0 100644
--- a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml
+++ b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml
@@ -14,12 +14,12 @@ body:
     id: tier
     attributes:
       label: Tier
-      description: Which bundle tier did you break? Pick `instructor` if the finding is on the `intermediate_instructor` companion.
+      description: Which bundle tier did you break? `intermediate_instructor` is the only instructor companion shipped in v1.
       options:
         - intro
         - intermediate
         - advanced
-        - instructor
+        - intermediate_instructor
         - multiple
     validations:
       required: true
@@ -38,8 +38,14 @@ body:
     id: bundle_hash
     attributes:
       label: Bundle hash
-      description: Paste a hash that identifies the exact bundle. Either `manifest.json` → `bundle_hash` (if your build records one), or a `sha256sum tasks/converted_within_90_days/test.parquet` from your local checkout. This makes the report unambiguous if we regenerate.
-      placeholder: "sha256:abc123… (or manifest.json bundle_hash)"
+      description: |
+        Paste a hash that identifies the exact bundle. Two equivalent forms:
+
+        - From `manifest.json` → `tasks.converted_within_90_days.test_sha256` (the test-split sha256 the bundle records; pinned at generation time).
+        - From a local `sha256sum tasks/converted_within_90_days/test.parquet`.
+
+        This makes the report unambiguous if we regenerate.
+      placeholder: "d428c07decc2b1fdf8b5f56a1a63c65799897f1e22b61afd9f5d517f74593f09"
     validations:
       required: true
 
diff --git a/docs/release/break_me_guide.md b/docs/release/break_me_guide.md
index 3f8e844..1b326ea 100644
--- a/docs/release/break_me_guide.md
+++ b/docs/release/break_me_guide.md
@@ -94,12 +94,16 @@ in three lines of pandas; the column it surfaces is
 A feature can score ~0.5 AUC as a single-column ranker and
 still hand a tree model material lift once interactions with
 other columns are available. The validation report's
-`post_snapshot_aggregates` baseline (a fitted LR on the trap
-column alone) gives ~0.55 AUC — the trap "looks" innocuous on
-a standalone audit. Notebook 03 §5 then runs a full panel
-ablation and HistGBM extracts +0.032 AUC; LR with the same
-preprocessing only extracts +0.009 because it can't represent
-the relevant interaction.
+`post_snapshot_aggregates` baseline (HistGBM on the trap
+column alone, see
+[`leadforge/validation/release_quality.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/release_quality.py))
+gives ~0.55 AUC on intermediate (median across seeds 42–46;
+0.52–0.61 across all tier × seed pairs) — the trap "looks"
+innocuous even when scored by a tree model on its own.
+Notebook 03 §5 then runs a full panel ablation and HistGBM
+extracts +0.032 AUC; LR with the same preprocessing only
+extracts +0.009 because it can't represent the relevant
+interaction.
 
 **How to detect on any dataset.** Don't audit leakage with
 single-feature AUC. For every column you flagged in pattern 1,
@@ -266,11 +270,13 @@ gives the canonical numbers across seeds.
 A `precision >= threshold` operating point and a `top-K by
 rank` operating point are not the same thing when probabilities
 have ties. Notebook 04 §6 picks a threshold that "should"
-admit 50 leads and reads back `actually_above` to surface
-when ties at the operating point inflate the slate beyond
-capacity. On a fresh seed this can quietly admit 70+ leads
-into a 50-lead capacity plan if several ties sit at the
-chosen probability.
+admit 50 leads and reads back `actually_above` as a defensive
+instrument — on the as-shipped intermediate bundle the realised
+count happens to match capacity, but the readout exists so a
+seed where ties cluster at the operating probability fails
+loud rather than silently inflating the slate. On a calibrated
+LR with continuous scores, ties are rare; on a coarse-grained
+GBM probability output they're routine.
 
 **How to detect on any dataset.** When you set a probability
 threshold for a fixed-capacity decision, always log the
@@ -296,11 +302,10 @@ probabilities.
 The validation report tracks `calibration_max_bin_error`
 per tier (`$.tiers.<tier>.medians.calibration_max_bin_error`)
 — intermediate ~0.25, intro ~0.25, advanced ~0.52. That's a
-single number per tier on a single split; it can mask
-segment-conditional miscalibration where the model is
-well-calibrated overall but consistently over-predicts on
-small-revenue accounts and under-predicts on large ones, or
-drifts late-in-cohort vs early. Notebook 04 §3 shows the
+single number per tier on a single split; in principle it can
+mask segment-conditional miscalibration. Whether v1 actually
+exhibits such drift is an open question — the per-segment
+audit is the way to find out. Notebook 04 §3 shows the
 tier-level reliability diagram on the public bundle; the
 analogous per-segment diagram is the next stress test.
 
@@ -325,9 +330,13 @@ doesn't ship a per-segment calibration audit; it's a
 ## What to do when you find one
 
 1. Reproduce the finding from a clean checkout against the
-   as-shipped bundle. Note the seed, tier, and `manifest.json`
-   `bundle_hash` (or a freshly computed file hash if your
-   build doesn't expose one).
+   as-shipped bundle. Note the seed, tier, and the test-split
+   sha256 from `manifest.json` — under
+   `tasks.converted_within_90_days.test_sha256`. That single
+   hash uniquely identifies the bundle the finding was
+   reproduced on; the manifest also carries per-table hashes
+   under `tables.<name>.sha256` if a table-specific hash is
+   the right anchor for the finding.
 2. Pick the issue template that fits — leakage / contamination
    / metric findings go in [`dataset_breakage_report.yml`](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml);
    distributional / realism critiques go in

From 943b2f79a6d05a6c812d7065b12059724b40bb8e Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Fri, 8 May 2026 00:27:48 +0300
Subject: [PATCH 3/4] PR 6.3 self-review round 2 + planning: mock-page preview
 PR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hostile-reviewer pass on round 1 caught 8 more issues; one of them
(SOURCE_TREE_BLOCK staleness) was an issue I'd already deferred once,
which I'm calling a cop-out and fixing now. Also folds in a planning
update for the v1 release roadmap: insert a new PR before the publish
PR that lets the maintainer render local mock-page previews of the
Kaggle and HuggingFace dataset pages from the *exact* upload artefacts
before any platform upload — staging gate so styling, link, embed,
and YAML-rendering issues are caught before they hit the live page
where rollback is expensive.

### Round-2 fixes (8)

1. **Pattern 6 metric mismatch.** Cohort shift uses HistGBM (release_quality.py:463/514), so the cross-seed band to compare against is `gbm_auc.spread`, not `lr_auc.spread`. Fixed.

2. **Pattern 5 numerically anchored.** "Two leads in the same account can land in train and test" was vague; the empirical reality is **518 of 557 test accounts (93 %) appear in train** — and the same numbers hold across all three tiers because the splitter is `lead_id`-keyed. Added a copy-pasteable pandas snippet so reporters can verify in 4 lines.

3. **Pattern 4 self-contradiction.** Original text said "we deliberately don't show that" while showing the leakage one-liner inline. Reworded so the contrast is clear: notebook 02 doesn't show the leakage variant; the guide does, so reviewers recognise it in code.

4. **Cross-platform hash command.** Issue template said `sha256sum tasks/.../test.parquet` — that's GNU coreutils, not present on macOS. Added `shasum -a 256` (macOS) and a portable Python one-liner; led with "easiest source: copy from manifest.json".

5. **Relative links within docs/.** Round 1 used absolute GitHub URLs for every in-repo link in the break-me guide, breaking local-clone reading (clicks went to GitHub web instead of opening the local file). Switched to relative paths within the docs tree; kept absolute URLs only where they're load-bearing (notebook forward-pointers ship to Kaggle/HF without `docs/`; issue-template Markdown bodies render via GitHub Issues; README's `](../foo)` gets rewritten by `_release_common.py` for Kaggle/HF).

6. **`expected_acv_capture_at_k` JSON path lands on a dict.** `$.tiers.<tier>.per_seed[*].expected_acv_capture_at_k` resolves to `{"50": …, "100": …}` keyed by string K. Pinned the path to `…expected_acv_capture_at_k.50` so a reader following the citation hits a scalar.

7. **Pattern 8 GBM-tie speculation.** "On a coarse-grained GBM probability output ties are routine" was over-stated — HistGBM's `predict_proba` is continuous via leaf-score sums. Dropped the LR-vs-GBM theorising; kept the empirical observation that on the as-shipped intermediate bundle the realised count matches capacity.

8. **SOURCE_TREE_BLOCK staleness.** Both `release/README.md` and `_release_common.py`'s `SOURCE_TREE_BLOCK` constant listed `notebooks/01_baseline_lead_scoring.ipynb` as the only notebook; four ship now. Updated to a one-liner that names all four. Audit-sync test passes.

### Planning update — local mock-page preview PR

`docs/release/v1_release_roadmap.md` and `.agent-plan.md` updated:

- Phase 7 grows from 2 PRs to 3.
- New **PR 7.2** sits between the LLM critique PR (now 7.1) and the publish PR (now 7.3): `scripts/preview_kaggle_page.py` and `scripts/preview_hf_page.py` render offline HTML mocks from the *exact* upload artefacts (`release/kaggle/dataset-metadata.json` + inlined README + cover image; `release/huggingface/README.md` with frontmatter + body), serve over `localhost`, accept `--variant=public|instructor` (HF) and `--port` / `--open-browser` flags. Tests cover required-field presence, link resolution, schema-column listing, configs-block round-trip.
- The publish PR's runbook (`v1_release_notes.md`) cites the preview commands as required pre-flight.
- Phase summary table + total PR count (14 → 15) updated for consistency.

Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy
clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67;
`validate_release_candidate --no-rebuild` exits 0;
BUNDLE_SCHEMA_VERSION unchanged at 5.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .agent-plan.md                                | 10 +--
 .../dataset_breakage_report.yml               | 11 ++-
 docs/release/break_me_guide.md                | 87 +++++++++++--------
 docs/release/v1_release_roadmap.md            | 38 +++++---
 release/README.md                             |  2 +-
 scripts/_release_common.py                    |  2 +-
 6 files changed, 91 insertions(+), 59 deletions(-)

diff --git a/.agent-plan.md b/.agent-plan.md
index 33891c2..aceeaab 100644
--- a/.agent-plan.md
+++ b/.agent-plan.md
@@ -62,12 +62,10 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
 - [x] PR 6.2: `release/notebooks/03_leakage_and_time_windows.ipynb` and `release/notebooks/04_lift_calibration_value_ranking.ipynb` added.  Notebook 03 turns the documented `total_touches_all` trap into a teaching moment: reads the trap label off `feature_dictionary.csv`, proves the trap by construction via a same-table comparison of `total_touches_all` (full-horizon) vs `touch_count` (snapshot-safe) — the post-snapshot delta sums to ~3.2 touches/lead and 82 % of leads have a positive delta — and then runs a standalone-AUC probe on the trap (~0.53 AUC, looks innocuous) followed by a side-by-side full-panel ± trap ablation that shows HistGBM extracts ~+0.032 AUC from the same column LR can only squeeze ~+0.009 from.  The reframed pedagogy (vs the prompt's original "trap dominates a thin firmographic set" framing) is empirically driven: firmographic-only is at chance AUC even with the trap, but the GBM-vs-LR asymmetry on the strong panel is a real and useful finding — *standalone AUC probes undersell tree-friendly leakage*.  Sign-aware tolerance gate pins each AUC ±0.02 and asserts `gbm_lift > 0.015` so a future regeneration that erases the trap or accidentally amplifies it breaks CI.  Notebook 04 covers the four extra ranking lenses AUC alone misses: calibration / reliability diagram (max bin error ≈ 0.13), lift + cumulative gains (top-decile lift 2.75×), value-aware ranking via `expected_acv × P(convert)` (top-50 ACV-capture jumps from 0.16 to 0.40), threshold selection for fixed top-K capacity, cohort-shift evaluation (HistGBM on the first 85 % chronologically → score the last 15 %, mirrors `release_quality.measure_cohort_shift_from_bundle` down to `COHORT_TRAIN_FRAC=0.85` and `model_random_state=0`, **reproduces the report's `cohort_shift.intermediate` block exactly**: 0.8754 / 0.8908 / −0.0155), and a 200-iter bootstrap of the test-set AUC/AP as the within-bundle confidence band that public-bundle consumers (Kaggle / HF) can run without `leadforge` installed (the prompt's "seed-sweep harness" with bootstrap honestly acknowledged as the proxy for true cross-seed sweep, since rebuilding bundles isn't an option for downstream users).  Cohort-shift values are pinned via a new `cohort_shift.intermediate` block in `release/notebooks/_release_targets.json` (audit-synced against `validation_report.cohort_shift.intermediate` by a new `test_cohort_shift_targets_match_validation_report` extension to the existing audit-sync test).  Headline LR/GBM panel **drops** `total_touches_all` (matches notebook 02's posture, gives honest production numbers); cohort-shift section deliberately **keeps** the trap to reproduce the report's published cohort-shift numbers exactly — divergent posture explained inline.  Both new builders (`scripts/build_release_notebook_{03,04}.py`) inherit the deterministic-cell-ID + `--out` byte-stability pattern from PR 6.1 and are added to `_BUILDERS` / `_NOTEBOOKS` in `tests/scripts/test_release_notebook_builders.py` and `tests/release/notebooks/test_execute_notebooks.py`.  Both notebooks execute end-to-end in <10s each (well under G13.1's 3-min budget), assert `manifest.exposure_mode == "student_public"` (G13.3), and load only from `release/intermediate/`.  Forward-pointer to `docs/release/break_me_guide.md` left as plain backtick-wrapped text — file lands in PR 6.3, no dead Markdown link.  Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.
 - [x] PR 6.3: adversarial framing landed.  `docs/release/break_me_guide.md` (new) — meta-recipe playbook organised as a 4-step recipe (read the dictionary → ablate, don't just probe → check the time window → treat the train/test split as untrusted) + 9 patterns grouped by category (leakage / split discipline / metric and ranking traps / robustness and realism).  Each pattern carries a "how to detect on any dataset" recipe and a "worked example" pointer back into the v1 bundle (notebook §, validation_report JSON path, or feature_dictionary.csv field), so the guide extends the notebooks rather than duplicating them.  Three explicit promises notebook 04 §10 made are delivered: target-encoding leakage on test (pattern 4, anchored on NB02 §4.4), train-test contamination via `account_id` overlap (pattern 5, with the honest "v1 only checks `lead_id`, not `account_id`" caveat), cohort-by-segment evaluation (pattern 6, extends NB04 §7's tier-wide cohort-shift to per-segment using the actual segment columns: `industry`, `region`, `employee_band`, `estimated_revenue_band`).  Other 6 patterns: naming smells, standalone-AUC vs tree-ablation gap (NB03 finding generalised), time-window violations on engineered features (with the `customers`-table example), value-aware ranking surprises (P × ACV vs P-only), threshold-vs-rank ties at the operating point (NB04 §6 finding), calibration drift across cohorts and segments.  Triage-label table at the top (`critical-leakage` / `realism` / `difficulty` / `documentation` / `platform` / `notebook` / `pedagogy` / `v2-idea` / `out-of-scope-v1`) gives reporters a vocabulary; the same labels are auto-applied (`needs-triage`) by the issue templates.  `docs/release/v2_decision_log.md` (new, empty stub) — schema documented in the file's preamble (7 columns: `received_at` / `source` / `topic` / `severity` / `verdict` / `next_step` / `link`; verdict vocabulary `accepted-for-v2` / `deferred` / `wont-fix` / `needs-investigation` with explicit semantics for each).  `.github/ISSUE_TEMPLATE/dataset_breakage_report.yml` (new) and `.github/ISSUE_TEMPLATE/realism_feedback.yml` (new) — GitHub Issue Forms YAML, both carry the `dataset: leadforge-lead-scoring-v1` + `needs-triage` labels.  Breakage report: tier dropdown (intro / intermediate / advanced / instructor / multiple), seed input (default 42), bundle hash field (validation: required), suggested triage label dropdown, severity dropdown (high/medium/low), summary, minimal repro, expected-vs-actual citing JSON paths, environment, two confirmation checkboxes (read break-me guide; reporting on as-shipped bundle).  Realism feedback: aspect dropdown (industry mix / persona / funnel timing / channel / pricing / account-to-lead density / region / other), tier(s)-affected dropdown, domain-experience one-liner (required — helps weight findings), claim, data observation (with concrete pandas-snippet placeholder example), suggested fix (optional), severity, two confirmations (read README "Known limitations"; checked post_v1_roadmap + v2_decision_log).  Notebook 03 §7 and notebook 04 §10 forward-pointers upgraded from plain `docs/release/break_me_guide.md` text to Markdown links pointing at the GitHub blob URL (`https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md`) — relative path would break on Kaggle/HF where notebooks ship without the `docs/` tree, the blob URL works in both contexts.  `release/README.md` "Maintenance, adversarial framing, license" section rewritten: dead "(PR 6.3)" forward-pointers replaced with real Markdown links to the break-me guide, both issue templates, and the v2 decision log; `_release_common.py`'s existing `](../foo)` → GitHub-blob-URL rewriter handles the Kaggle/HF rendering automatically (verified by the regenerated `release/kaggle/dataset-metadata.json` and `release/huggingface/README.md` sync tests).  Hostile-reviewer self-review caught two factual hallucinations in the first revision before they shipped: claimed "15 industries" for `industry` (actually 4: logistics / healthcare_non_clinical / manufacturing / professional_services) and used loose segment-column names ("employee tier", "ARR band") instead of the actual columns (`employee_band`, `estimated_revenue_band`); both fixed.  Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5 (this PR is documentation-only).  Phase 6 closed — Phase 7 (LLM critique + publish) is next.
 
-### Phase 7 — LLM critique + publish
-- [ ] `leadforge/validation/llm_critique.py` (single-provider, env-var creds, skips cleanly)
-- [ ] `docs/release/llm_critique_prompt.md` + `scripts/run_llm_critique.py`
-- [ ] Adjudicate any high-severity findings (resolve in code or document in `v2_decision_log.md`)
-- [ ] `scripts/{publish_kaggle,publish_hf}.py` (dry-run → private/draft → public)
-- [ ] Tag `leadforge-lead-scoring-v1`; `docs/release/v1_release_notes.md`
+### Phase 7 — LLM critique + publish (3 PRs)
+- [ ] **PR 7.1** — `leadforge/validation/llm_critique.py` (single-provider, env-var creds, skips cleanly) + `docs/release/llm_critique_prompt.md` + `scripts/run_llm_critique.py`. Adjudicate any high-severity findings (resolve in code or document in `v2_decision_log.md`).
+- [ ] **PR 7.2** — local Kaggle + HuggingFace mock-page preview tooling (must land before PR 7.3): `scripts/preview_kaggle_page.py` and `scripts/preview_hf_page.py` render offline HTML mocks of the public Kaggle and HF dataset pages from the *exact* upload artefacts (metadata JSON, README, cover image), serve over `localhost`, and let the maintainer click through both pages in a browser before any platform upload — catches styling / link / YAML-rendering issues before they hit cached previews on the live page. Tests cover required-field presence, link resolution, schema column listing, configs-block round-trip.
+- [ ] **PR 7.3** — `scripts/{publish_kaggle,publish_hf}.py` (dry-run → local mock-page review → private/draft → public). Tag `leadforge-lead-scoring-v1`; `docs/release/v1_release_notes.md` (cites PR 7.2's preview commands as required pre-flight).
 
 ---
 
diff --git a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml
index 611d0d0..1b7a2e1 100644
--- a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml
+++ b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml
@@ -39,10 +39,15 @@ body:
     attributes:
       label: Bundle hash
       description: |
-        Paste a hash that identifies the exact bundle. Two equivalent forms:
+        Paste a hash that identifies the exact bundle. Easiest source:
+        copy `tasks.converted_within_90_days.test_sha256` straight out
+        of `manifest.json` (it's pinned at bundle generation time, so
+        no local hashing is needed). If you've modified the file
+        locally, recompute via:
 
-        - From `manifest.json` → `tasks.converted_within_90_days.test_sha256` (the test-split sha256 the bundle records; pinned at generation time).
-        - From a local `sha256sum tasks/converted_within_90_days/test.parquet`.
+        - **macOS:** `shasum -a 256 tasks/converted_within_90_days/test.parquet`
+        - **Linux:** `sha256sum tasks/converted_within_90_days/test.parquet`
+        - **Cross-platform:** `python -c "import hashlib,sys; print(hashlib.sha256(open(sys.argv[1],'rb').read()).hexdigest())" tasks/converted_within_90_days/test.parquet`
 
         This makes the report unambiguous if we regenerate.
       placeholder: "d428c07decc2b1fdf8b5f56a1a63c65799897f1e22b61afd9f5d517f74593f09"
diff --git a/docs/release/break_me_guide.md b/docs/release/break_me_guide.md
index 1b326ea..6548626 100644
--- a/docs/release/break_me_guide.md
+++ b/docs/release/break_me_guide.md
@@ -11,9 +11,9 @@ you can reproduce.
 
 If you find one of these on `leadforge-lead-scoring-v1`,
 file an issue using one of the templates in
-[`.github/ISSUE_TEMPLATE/`](https://github.com/leadforge-dev/leadforge/tree/main/.github/ISSUE_TEMPLATE).
+[`.github/ISSUE_TEMPLATE/`](../../.github/ISSUE_TEMPLATE).
 Accepted findings are logged in
-[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).
+[`v2_decision_log.md`](v2_decision_log.md).
 
 ## Triage labels
 
@@ -96,7 +96,7 @@ still hand a tree model material lift once interactions with
 other columns are available. The validation report's
 `post_snapshot_aggregates` baseline (HistGBM on the trap
 column alone, see
-[`leadforge/validation/release_quality.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/release_quality.py))
+[`leadforge/validation/release_quality.py`](../../leadforge/validation/release_quality.py))
 gives ~0.55 AUC on intermediate (median across seeds 42–46;
 0.52–0.61 across all tier × seed pairs) — the trap "looks"
 innocuous even when scored by a tree model on its own.
@@ -154,10 +154,11 @@ and you've leaked test labels into the feature. Notebook 02
 (four industries — logistics, healthcare_non_clinical,
 manufacturing, professional_services — encoded by their
 training-split conversion rate, with a global-mean fallback
-for industries not seen in train). The leakage version is a one-line change — using
-`pd.concat([train, test]).groupby('industry')['target'].mean()`
-instead — and we deliberately *don't* show that in the
-notebook because the lesson is the discipline, not the trap.
+for industries not seen in train). The leakage variant is a
+one-liner — `pd.concat([train, test]).groupby('industry')['target'].mean()`
+— and the notebook deliberately doesn't show it, because the
+lesson there is the discipline. This guide shows the leakage
+form (above) so you recognise it during code review.
 
 **How to detect on any dataset.** When mean-target encoding
 shows up in a notebook or pipeline, check three things in
@@ -183,20 +184,30 @@ fallback-to-train-mean handling is in `attach_engineered`.
 The bundle ships a deterministic 70/15/15 split on `lead_id`
 (see `tasks/<task>/task_manifest.json`). That guarantees
 `lead_id` uniqueness across splits — but `account_id` is
-*not* split on. Two leads in the same account can land in
-train and test, and the model can ride strong account-level
-signal across the split boundary in ways that don't generalise
-to a fresh account.
-
-**How to detect on any dataset.** Compute the intersection
-of `account_id` (or whatever the per-entity grouping key is)
-between train and test. If it's non-empty *and* you've
-engineered any account-level features, retrain with
-account-level grouped splitting (e.g. `GroupKFold` on
-`account_id`) and re-read the AUC delta. The delta is the
-amount of "free" lift the random-split was buying you. The
-right framing isn't "remove the leak"; it's *report both
-numbers so the reader knows which is which.*
+*not* split on. On the as-shipped intermediate bundle,
+**518 of 557 test accounts (93 %) also appear in train**;
+the same numbers hold on intro and advanced because the
+splitter is `lead_id`-keyed and tier-invariant. Models can
+ride strong account-level signal across the split boundary
+in ways that don't generalise to a fresh account.
+
+**How to detect on any dataset.**
+
+```python
+import pandas as pd
+train = pd.read_parquet("intermediate/tasks/converted_within_90_days/train.parquet")
+test  = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet")
+overlap = set(train["account_id"]) & set(test["account_id"])
+print(f"shared accounts: {len(overlap)} / {test['account_id'].nunique()}")
+```
+
+If the overlap is non-empty *and* you've engineered any
+account-level features, retrain with account-level grouped
+splitting (e.g. `GroupKFold` on `account_id`) and re-read the
+AUC delta. The delta is the amount of "free" lift the
+random-split was buying you. The right framing isn't "remove
+the leak"; it's *report both numbers so the reader knows
+which is which.*
 
 **Worked example.** Notebook 02 §4.2 builds an account-level
 density feature using *only* train leads' touches — a
@@ -227,10 +238,10 @@ down by 0.04 in the average.
 `estimated_revenue_band`), repeat the cohort-split protocol
 from notebook 04 §7 conditioned on that segment. Report the
 per-segment AUC degradation and the spread across segments.
-A spread larger than your tier-wide cross-seed band
-(`$.tiers.<tier>.spreads.lr_auc`) is a realism flag — the
-simulator is producing a homogeneous world that real
-production cohorts wouldn't be.
+A spread larger than the tier's cross-seed GBM-AUC band
+(`$.tiers.<tier>.spreads.gbm_auc` — same model the cohort-shift
+block uses) is a realism flag: the simulator is producing a
+homogeneous world that real production cohorts wouldn't be.
 
 **Worked example.** Notebook 04 §7 (tier-wide, validator-
 mirrored). The validation report's `cohort_shift.<tier>.auc_degradation`
@@ -261,9 +272,10 @@ so large it suggests the simulator's ACV column has unrealistic
 correlation with P(convert).
 
 **Worked example.** Notebook 04 §5 produces both curves
-side-by-side; the validation report's
-`$.tiers.<tier>.per_seed[*].expected_acv_capture_at_k`
-gives the canonical numbers across seeds.
+side-by-side; the validation report's per-seed scalars live
+under
+`$.tiers.<tier>.per_seed[*].expected_acv_capture_at_k.50`
+(and `.100` for top-100), keyed by string K.
 
 ### 8. Threshold-vs-rank semantics
 
@@ -272,11 +284,9 @@ rank` operating point are not the same thing when probabilities
 have ties. Notebook 04 §6 picks a threshold that "should"
 admit 50 leads and reads back `actually_above` as a defensive
 instrument — on the as-shipped intermediate bundle the realised
-count happens to match capacity, but the readout exists so a
-seed where ties cluster at the operating probability fails
-loud rather than silently inflating the slate. On a calibrated
-LR with continuous scores, ties are rare; on a coarse-grained
-GBM probability output they're routine.
+count matches capacity, but the readout exists so a seed where
+ties cluster at the operating probability fails loud rather
+than silently inflating the slate.
 
 **How to detect on any dataset.** When you set a probability
 threshold for a fixed-capacity decision, always log the
@@ -338,13 +348,14 @@ doesn't ship a per-segment calibration audit; it's a
    under `tables.<name>.sha256` if a table-specific hash is
    the right anchor for the finding.
 2. Pick the issue template that fits — leakage / contamination
-   / metric findings go in [`dataset_breakage_report.yml`](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml);
+   / metric findings go in
+   [`dataset_breakage_report.yml`](../../.github/ISSUE_TEMPLATE/dataset_breakage_report.yml);
    distributional / realism critiques go in
-   [`realism_feedback.yml`](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml).
+   [`realism_feedback.yml`](../../.github/ISSUE_TEMPLATE/realism_feedback.yml).
 3. Suggest a triage label from the table at the top of this
    guide. The maintainer applies the final label.
-4. Watch [`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md)
-   for the disposition. Accepted findings get an entry with
-   a verdict (`accepted-for-v2`, `deferred`, `wont-fix`,
+4. Watch [`v2_decision_log.md`](v2_decision_log.md) for the
+   disposition. Accepted findings get an entry with a verdict
+   (`accepted-for-v2`, `deferred`, `wont-fix`,
    `needs-investigation`) and a pointer to the resulting v2
    work item.
diff --git a/docs/release/v1_release_roadmap.md b/docs/release/v1_release_roadmap.md
index e7b0bb2..520f042 100644
--- a/docs/release/v1_release_roadmap.md
+++ b/docs/release/v1_release_roadmap.md
@@ -44,13 +44,13 @@ A release candidate is v1-ready when **all** of the following hold. Concrete ban
 | 4 | Channel-signal audit + dataset card | M-S | 1 | 3 | not started |
 | 5 | Platform packaging | M | 2 | 4 | not started |
 | 6 | Notebook sequence + adversarial framing | M-L | 3 | 5 | not started |
-| 7 | LLM critique + publish | M | 2 | 6 | not started |
+| 7 | LLM critique + publish | M | 3 | 6 | not started |
 
-**Total: 14 PRs.** Each PR follows the `CLAUDE.md` workflow: branch → commit → update `.agent-plan.md` → PR with type+layer labels → milestone assignment (`dataset: leadforge-lead-scoring-v1`). PR-level decomposition is in the **PR breakdown** section immediately below.
+**Total: 15 PRs.** Each PR follows the `CLAUDE.md` workflow: branch → commit → update `.agent-plan.md` → PR with type+layer labels → milestone assignment (`dataset: leadforge-lead-scoring-v1`). PR-level decomposition is in the **PR breakdown** section immediately below.
 
 ## PR breakdown
 
-First-cut decomposition of the 7 phases into ~14 PRs. The numbering `phase.seq` is a planning ID, not a GitHub PR number. Sizes are estimates; we may merge or split during implementation. Within a phase, PRs are typically sequential (later sub-PRs depend on earlier ones); cross-phase dependencies follow the phase summary above.
+First-cut decomposition of the 7 phases into ~15 PRs. The numbering `phase.seq` is a planning ID, not a GitHub PR number. Sizes are estimates; we may merge or split during implementation. Within a phase, PRs are typically sequential (later sub-PRs depend on earlier ones); cross-phase dependencies follow the phase summary above.
 
 ### Phase 1 — Audit and naming (1 PR)
 
@@ -154,7 +154,7 @@ First-cut decomposition of the 7 phases into ~14 PRs. The numbering `phase.seq`
   - Labels: `type: docs`
   - Size: S (~300 lines)
 
-### Phase 7 — LLM critique + publish (2 PRs)
+### Phase 7 — LLM critique + publish (3 PRs)
 
 - **PR 7.1** — `feat(validation): llm_critique module + prompt + driver`
   - `leadforge/validation/llm_critique.py` — single-provider, env-var creds, skip-cleanly without
@@ -165,18 +165,29 @@ First-cut decomposition of the 7 phases into ~14 PRs. The numbering `phase.seq`
   - Labels: `type: feature`, `layer: validation`
   - Size: M (~500 lines)
 
-- **PR 7.2** — `feat(scripts): publish_kaggle + publish_hf + tag v1 release`
+- **PR 7.2** — `feat(scripts): local Kaggle + HuggingFace mock-page preview` ⚠️ **must land before PR 7.3**
+  - **Goal:** before any real Kaggle/HF publish, the maintainer can render a faithful local preview of how each platform will display the dataset and click through it in a browser. Catch styling, link, embed, and YAML-rendering issues *before* they land on the live page where rollback is expensive (Kaggle and HF both keep cached previews around).
+  - `scripts/preview_kaggle_page.py` — reads `release/kaggle/dataset-metadata.json` + the inlined README + the cover image, renders an offline HTML mock that mimics the public Kaggle dataset page (header, description, schema/columns table, file tree, license footer). Serves on `http://localhost:8765` via `python -m http.server` or a small Flask shim.
+  - `scripts/preview_hf_page.py` — reads `release/huggingface/README.md` (YAML frontmatter + body), renders an offline HTML mock that mimics the HF dataset page (frontmatter pills, configs dropdown, README body, file tree). Serves on `http://localhost:8766`.
+  - Both scripts: `--release-dir`, `--port`, `--variant=public|instructor` (HF only), `--open-browser`. Dry-run / no-network.
+  - Both must round-trip the *exact* artefacts the publish PR will upload — same metadata JSON, same README, same cover image — so the preview is faithful, not a sketch.
+  - Tests: `tests/scripts/test_preview_kaggle_page.py` + `tests/scripts/test_preview_hf_page.py`. Each renders the page once and asserts: required field labels appear, every Markdown link in the source resolves to a non-404 URL pattern, every config block is present, the Kaggle schema table lists every CSV/parquet column.
+  - Pedagogically: this is the staging gate. The release runbook (`docs/release/v1_release_notes.md` in PR 7.3) cites both preview commands as required steps before `kaggle datasets create` / `huggingface-cli upload`.
+  - Labels: `type: feature`, `layer: cli`
+  - Size: M (~600 lines — two HTML templates + two render scripts + two test files)
+
+- **PR 7.3** — `feat(scripts): publish_kaggle + publish_hf + tag v1 release`
   - `scripts/publish_kaggle.py`
   - `scripts/publish_hf.py`
   - `docs/release/v1_release_notes.md`
-  - Dry-run → private/draft → public publish (manual step performed by maintainer with credentials, within the PR or as a follow-up release tag)
+  - Dry-run → private/draft → public publish (manual step performed by maintainer with credentials, within the PR or as a follow-up release tag). The runbook references PR 7.2's preview commands as a required pre-flight.
   - Tag `leadforge-lead-scoring-v1`
   - Labels: `type: feature`, `layer: cli`
   - Size: S (~300 lines code + manual publish step)
 
 ## PR breakdown — totals
 
-- **14 PRs** across 7 phases.
+- **15 PRs** across 7 phases.
 - Estimated total LoC: ~6,500 (excluding regenerated parquet bundles and notebook JSON).
 - All 14 PRs target the `dataset: leadforge-lead-scoring-v1` GitHub milestone.
 - Calendar duration is not committed; depends on iteration cadence and review feedback.
@@ -420,22 +431,29 @@ First-cut decomposition of the 7 phases into ~14 PRs. The numbering `phase.seq`
 - New `docs/release/llm_critique_prompt.md` — the rubric document, structured as the prompt the script feeds.
 - New `scripts/run_llm_critique.py` — driver: builds the input bundle (README.md, dataset card, generation method, manifest, feature dictionary, validation report, first 100 public rows, public/instructor diff summary, public-safe mechanism summary) → calls the critique → writes `release/validation/llm_critique_raw_*.json` and `release/validation/llm_critique_summary.md`.
 - Adjudicate any high-severity findings; resolve in code or document acknowledgment in `v2_decision_log.md` if intentional-and-accepted.
+- **Local mock-page preview (PR 7.2 — must land before publish):** maintainer renders Kaggle and HF dataset pages locally from the actual upload artefacts (the same metadata JSON, README, cover image the publish PR will use) and clicks through them in a browser before any platform upload, so styling / link / YAML-rendering issues are caught before they hit cached previews on the live page.
+  - `scripts/preview_kaggle_page.py` — reads `release/kaggle/dataset-metadata.json` + inlined README + cover image, renders an offline HTML page that mimics the public Kaggle dataset view.
+  - `scripts/preview_hf_page.py` — reads `release/huggingface/README.md` (frontmatter + body), renders the analogous HF view.
+  - Both serve over `python -m http.server` (or a small Flask shim) and accept `--variant=public|instructor` (HF), `--port`, `--open-browser`.
+  - Tests: required field labels appear, every Markdown link resolves to a non-404 URL pattern, every config block is present, the Kaggle schema table lists every CSV/parquet column.
 - New `scripts/publish_kaggle.py` — uses `kagglehub.dataset_upload()` with `version_notes` containing the commit hash and tag.
 - New `scripts/publish_hf.py` — uses `huggingface_hub.HfApi().upload_folder()` with the dataset repo type.
 - Tag the release: `leadforge-lead-scoring-v1`. Tag the leadforge package release if a coordinated package version bump is needed (TBD — likely just a patch bump).
-- `docs/release/v1_release_notes.md` — public-facing release notes.
-- Both publish scripts exercised in **dry-run** before actual upload, then upload to **private/draft** repos for download smoke test, then promote to public.
+- `docs/release/v1_release_notes.md` — public-facing release notes; references the PR 7.2 preview commands as a required pre-flight step.
+- Both publish scripts exercised in **dry-run** before actual upload, **and the local mock-page previews from PR 7.2 reviewed in a browser**, then upload to **private/draft** repos for download smoke test, then promote to public.
 
 **Files touched:**
 - `leadforge/validation/llm_critique.py` (new)
 - `docs/release/llm_critique_prompt.md` (new)
 - `docs/release/v1_release_notes.md` (new)
-- `scripts/run_llm_critique.py`, `scripts/publish_kaggle.py`, `scripts/publish_hf.py` (new)
+- `scripts/run_llm_critique.py`, `scripts/preview_kaggle_page.py`, `scripts/preview_hf_page.py`, `scripts/publish_kaggle.py`, `scripts/publish_hf.py` (new)
+- `tests/scripts/test_preview_kaggle_page.py`, `tests/scripts/test_preview_hf_page.py` (new)
 - `release/validation/llm_critique_raw_*.json`, `release/validation/llm_critique_summary.md` (output artifacts)
 
 **Acceptance:**
 - LLM critique runs successfully with credentials; produces structured findings.
 - No unresolved high-severity findings before tag.
+- Local Kaggle and HF preview pages render against the as-shipped upload artefacts and are reviewed in a browser before any platform upload.
 - Both platform publishes succeed in dry-run.
 - Both private/draft uploads succeed; download smoke test passes from a clean environment.
 - Public Kaggle and HF pages render the dataset; `load_dataset()` from a clean env works.
diff --git a/release/README.md b/release/README.md
index d27b608..d548614 100644
--- a/release/README.md
+++ b/release/README.md
@@ -35,7 +35,7 @@ release/
 │   ├── tables/*.parquet              # 7 snapshot-safe relational tables
 │   └── tasks/converted_within_90_days/{train,valid,test}.parquet
 ├── intermediate_instructor/          # research companion: full-horizon tables + metadata/
-├── notebooks/01_baseline_lead_scoring.ipynb
+├── notebooks/                        # 01 baseline · 02 relational · 03 leakage · 04 calibration
 └── validation/                       # validation_report.{json,md} + figures
 ```
 
diff --git a/scripts/_release_common.py b/scripts/_release_common.py
index 3a59858..4d5f4d9 100644
--- a/scripts/_release_common.py
+++ b/scripts/_release_common.py
@@ -101,7 +101,7 @@ class ValidationError:
 │   ├── tables/*.parquet              # 7 snapshot-safe relational tables
 │   └── tasks/converted_within_90_days/{train,valid,test}.parquet
 ├── intermediate_instructor/          # research companion: full-horizon tables + metadata/
-├── notebooks/01_baseline_lead_scoring.ipynb
+├── notebooks/                        # 01 baseline · 02 relational · 03 leakage · 04 calibration
 └── validation/                       # validation_report.{json,md} + figures
 ```"""
 

From 4f70a16527fd3f9f0864584a2e69214912eaecff Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Fri, 8 May 2026 00:33:17 +0300
Subject: [PATCH 4/4] PR 6.3: address Copilot review on issue templates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three Copilot comments on the issue templates, all accepted:

1. **COPILOT-1** (dataset_breakage_report.yml:9 → realism template
   link). The relative `realism_feedback.yml` link doesn't resolve
   when the description renders in the GitHub Issue Forms chooser.
   Replaced with the "open another template directly" URL form
   (`/issues/new?template=realism_feedback.yml`), which lands a
   misrouted reporter on the realism *form*, not on raw YAML.

2. **COPILOT-2** (dataset_breakage_report.yml:139 → `render: text`).
   `render: text` isn't a documented value for Issue Forms textareas
   (the supported set is language identifiers from GitHub's syntax
   highlighter, plus `markdown`). The intent here is plain-text
   formatting, which is the default when `render` is omitted. Dropped
   the `render` line entirely.

3. **COPILOT-3** (realism_feedback.yml:58 → placeholder mentions
   `retail`). Same hallucination class as the round-1 "15 industries"
   fix. The actual industries are logistics, healthcare_non_clinical,
   manufacturing, professional_services. Rewrote the placeholder to
   use the real names so a reporter glancing at the example doesn't
   think `retail` is a category they should be filing about.

Net: 1260/1260 tests pass + 5 publish-extra-gated skips; ruff + mypy
clean; YAML validates; BUNDLE_SCHEMA_VERSION unchanged at 5.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .github/ISSUE_TEMPLATE/dataset_breakage_report.yml | 3 +--
 .github/ISSUE_TEMPLATE/realism_feedback.yml        | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml
index 1b7a2e1..4809c9b 100644
--- a/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml
+++ b/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml
@@ -6,7 +6,7 @@ body:
   - type: markdown
     attributes:
       value: |
-        Thank you for breaking the dataset on purpose. This template is for findings that affect *what's in the bundle* — leakage, split contamination, metric inversions, notebook failures. Distributional / realism critiques (e.g. "industry mix doesn't look like real procurement") belong in the [realism feedback template](realism_feedback.yml) instead.
+        Thank you for breaking the dataset on purpose. This template is for findings that affect *what's in the bundle* — leakage, split contamination, metric inversions, notebook failures. Distributional / realism critiques (e.g. "industry mix doesn't look like real procurement") belong in the [realism feedback template](https://github.com/leadforge-dev/leadforge/issues/new?template=realism_feedback.yml) instead.
 
         The [break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues the patterns this template is shaped around.
 
@@ -136,7 +136,6 @@ body:
         scikit-learn==1.5.0
         pandas==2.2.2
         Python 3.11.9, macOS 14.5
-      render: text
     validations:
       required: false
 
diff --git a/.github/ISSUE_TEMPLATE/realism_feedback.yml b/.github/ISSUE_TEMPLATE/realism_feedback.yml
index 07091f2..a37aa66 100644
--- a/.github/ISSUE_TEMPLATE/realism_feedback.yml
+++ b/.github/ISSUE_TEMPLATE/realism_feedback.yml
@@ -55,7 +55,7 @@ body:
       label: Claim
       description: What does the dataset get wrong, and what would you expect instead? One paragraph.
       placeholder: |
-        The intermediate tier shows a 22% conversion rate on the manufacturing industry slice, identical (within noise) to the rate on healthcare and retail. In real procurement / AP automation, manufacturing typically converts ~1.5x healthcare because manufacturing already has discrete-item AP volume that benefits more from automation. The dataset's conversion rate should differ across industries by 1.3-2x, not be flat.
+        The intermediate tier shows a 22% conversion rate on the manufacturing industry slice, identical (within noise) to the rates on logistics, healthcare_non_clinical, and professional_services. In real procurement / AP automation, manufacturing typically converts ~1.5x healthcare_non_clinical because manufacturing already has discrete-item AP volume that benefits more from automation. The dataset's conversion rate should differ across industries by 1.3-2x, not be flat.
     validations:
       required: true