From 15413f89c14a57f2a4a7a930f4f57cc71a603cbe Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Sat, 9 May 2026 08:40:02 +0300
Subject: [PATCH 1/6] PR 7.2 scaffold: design doc + markdown-it-py [publish]
 dep + .gitignore for preview output
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- docs/release/preview_pages_design.md — 10-decision table covering
  scripts shape, server (stdlib http.server), templates (f-strings),
  Markdown renderer (markdown-it-py via [publish] extra), output dirs,
  CLI shape, audit-artefact-sync, test posture, link-resolution rule,
  out-of-scope.
- pyproject.toml [publish] gains markdown-it-py>=3.0 alongside
  datasets / kaggle. Same gating posture as PR 5.1 / 5.2 — preview
  scripts raise a clean ImportError pointing at this extra when missing.
- .gitignore: release/_preview/ runtime output excluded; the audit-sync
  samples under release/_preview_committed/ are checked in separately.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .gitignore                           |  6 +++
 docs/release/preview_pages_design.md | 59 ++++++++++++++++++++++++++++
 pyproject.toml                       |  6 ++-
 3 files changed, 70 insertions(+), 1 deletion(-)
 create mode 100644 docs/release/preview_pages_design.md
diff --git a/.gitignore b/.gitignore
index a991601..e9893bd 100644
--- a/.gitignore
+++ b/.gitignore
@@ -233,3 +233,9 @@ release/huggingface/*
 !release/huggingface/README.md
 release/huggingface-instructor/*
 !release/huggingface-instructor/README.md
+
+# Generated local preview-page output (PR 7.2) — runtime HTML rendered
+# by scripts/preview_{kaggle,hf}_page.py.  The committed sample HTML
+# under release/_preview_committed/ is the audit-artefact-sync gate
+# and is checked into git separately.
+release/_preview/
diff --git a/docs/release/preview_pages_design.md b/docs/release/preview_pages_design.md
new file mode 100644
index 0000000..0742f59
--- /dev/null
+++ b/docs/release/preview_pages_design.md
@@ -0,0 +1,59 @@
+# PR 7.2 — Local Kaggle / HF preview-page design notes
+
+Working notes for `scripts/preview_kaggle_page.py`,
+`scripts/preview_hf_page.py`, their tests, and the committed
+sample-rendered HTML used as the audit-artefact-sync gate. Captured
+before implementation; kept short on purpose.
+
+The PR's pedagogical role is the *staging gate* before PR 7.3: the
+maintainer renders both platforms locally from the same artefacts the
+publish PR will upload, clicks through them in a browser, and catches
+styling / link / YAML-rendering issues before they hit cached
+previews on the live page.
+
+## Decisions
+
+| # | Decision | Why |
+|---|---|---|
+| 1 | Two scripts, one per platform. Not a unified renderer. | Kaggle and HF have different inputs (`dataset-metadata.json` vs YAML-frontmatter README) and different page structures (schema/columns table vs configs dropdown). One file per platform keeps each renderer locally complete and the diff readable. |
+| 2 | Server: stdlib `http.server.ThreadingHTTPServer` + `webbrowser.open()`. No Flask. | The pages are static HTML over a fixed file tree. A web framework would be a new dep with no benefit; the brief explicitly suggests stdlib. |
+| 3 | Templates: f-string helpers, not Jinja2. | Layout is layout-stable; two pages don't justify a templating engine. f-string helpers keep the renderer in one file and free of a new dep. |
+| 4 | Markdown→HTML via `markdown-it-py` (added to `[publish]` extra alongside `datasets` / `kaggle`). | Faithfulness is the goal — Kaggle and HF both render the README body as Markdown, hand-rolling a renderer for tables / fenced code / footnotes is brittle. `markdown-it-py` is MIT, pure-Python, CommonMark+GFM. The `[publish]` extra is the right home: this is a publish-pipeline tool, mirrors the PR 5.1 / 5.2 gating posture. Missing dep raises a clean `ImportError` that points at `pip install -e ".[publish]"`. |
+| 5 | Output dir: `release/_preview/<platform>/` (gitignored). | Mirrors `release/_release_quality/` convention. The committed audit-sync samples live at `release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html` so they don't collide with runtime output. |
+| 6 | Cover image served from the preview tree (copied in, not referenced). | Both platforms inline-display the cover image; serving it under the preview root means the rendered HTML's `<img src="dataset-cover-image.png">` works without absolute paths. The committed sample HTML uses the same relative reference — no path drift between the sample and what the local server emits. |
+| 7 | HF `--variant=public|instructor` reads either `release/huggingface/README.md` or `release/huggingface-instructor/README.md`. Different YAML, different file tree, different name. Kaggle has no instructor variant (Kaggle ships public only). | Matches the publish reality (HF gets a separate instructor companion repo per PR 5.2; Kaggle does not). |
+| 8 | CLI mirrors `validate_release_candidate.py` / `run_llm_critique.py`: free-function `parse_args`, frozen `Config`, `run_preview(config) -> Outcome`, `main(argv) -> int`. Exit codes 0 success / 2 pre-flight error. Flags: `--release-dir`, `--port` (8765 Kaggle / 8766 HF), `--out-dir`, `--variant` (HF only), `--open-browser`, `--no-serve`. | Maintainer muscle memory + small surface. `--no-serve` is the CI / inspection mode (build HTML, exit 0). `--open-browser` pops a tab on startup. |
+| 9 | Audit-artifact-sync. The renderer is pure: `(metadata.json | README + YAML, cover image filename) -> HTML`. No `now()`, no random. Committed HTML at `release/_preview_committed/*.html` must equal a fresh regeneration byte-for-byte. Same pattern as PR 4.1 / 5.1 / 5.2 / 7.1. | Determinism is the gate against silent drift. The committed HTML doubles as a human-inspectable sample for reviewers who don't want to run the script. |
+| 10 | Test posture: in-process. No live HTTP. Each test renders the page once via `render_kaggle_html()` / `render_hf_html()` and asserts against the rendered string with substring + regex. No BeautifulSoup dep (avoidable for the assertion bar we need). The four roadmap-mandated checks: required field labels appear; every Markdown link in the source resolves to a non-404 URL pattern; every config block (HF) round-trips; the Kaggle schema table lists every CSV / parquet column from `resources[].schema.fields`. | Per the brief — no live HTTP, no new test deps unless necessary. Substring assertions on deterministic rendered HTML give the same coverage with less surface. |
+
+## Link-resolution rule (test pin)
+
+Every Markdown link `](URL)` in the README body the renderer ingests
+must satisfy ONE of:
+
+1. Absolute `https://github.com/leadforge-dev/leadforge/...` URL (the
+   rewrite output of `_release_common.py::rewrite_release_links()`).
+2. External absolute URL on a known-OK domain (`https://huggingface.co`,
+   `https://github.com/leadforge-dev/leadforge`, footnote anchors).
+3. Relative path that resolves to a file under the upload tree
+   (e.g. `LICENSE` → `release/<platform>/LICENSE`).
+
+A `](../foo)` link or a `](validation/...)` link in the rendered
+HTML is a regression — those are exactly what the platform packagers'
+rewrite is supposed to canonicalise away. The test fires loud the
+moment the rewrite stops doing its job for the upstream artefact the
+preview renders.
+
+## What this PR does not touch
+
+- `BUNDLE_SCHEMA_VERSION` stays at 5.
+- `release/validation/validation_report.{json,md}` does not regenerate
+  (revert any timestamp drift before commit).
+- PR 7.3 (publish + tag) is a separate PR; the runbook there will cite
+  the two preview commands as a required pre-flight step.
+- No change to the platform packagers (`scripts/package_{kaggle,hf}_release.py`)
+  or `_release_common.py`. The preview reads what the packagers wrote.
+- Live Kaggle / HF API calls — pure local rendering only.
+- Pixel-perfect cloning of the live pages. The bar is "a maintainer
+  clicking through it would notice the same broken link, malformed
+  YAML, or missing config that they'd notice on the live page".
diff --git a/pyproject.toml b/pyproject.toml
index 79c250c..cd31b4f 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -49,10 +49,14 @@ scripts = [
 # this extra (``pip install -e ".[publish]"``) enables the gated
 # ``load_dataset()`` / Kaggle-CLI smoke tests that verify G11.3 (Kaggle
 # package) and G12.3 / G12.4 (HF load_dataset round-trip) without
-# pulling the heavy SDKs into the default dev install.
+# pulling the heavy SDKs into the default dev install.  PR 7.2 adds
+# ``markdown-it-py`` for the local Kaggle / HF preview pages
+# (``scripts/preview_{kaggle,hf}_page.py``) — same publish-extra
+# posture, missing import raises a clean error pointing at this extra.
 publish = [
     "datasets>=2.14",
     "kaggle>=1.6",
+    "markdown-it-py>=3.0",
 ]
 # Optional dependencies for executing the public release notebooks.
 # Installing this extra (``pip install -e ".[notebooks]"``) enables the

From 81b49ad0afcefa82c2215057341bec4402e6ea95 Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Sat, 9 May 2026 09:53:10 +0300
Subject: [PATCH 2/6] PR 7.2: local Kaggle / HF preview pages + tests +
 committed sample HTML
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two scripts + tests + their first rendered-page output committed:

- scripts/preview_kaggle_page.py — reads release/kaggle/dataset-metadata.json
  + cover image, renders an offline HTML page mocking the public Kaggle
  view (header / cover / description / file tree / schema tables /
  sources / footer). Serves on localhost:8765 via stdlib
  ThreadingTCPServer; --no-serve / --open-browser flags.
- scripts/preview_hf_page.py — reads release/huggingface[-instructor]/README.md
  (YAML frontmatter + body), renders the analogous HF view (header pills /
  tag chips / configs dropdown / file tree / README body / footer).
  Serves on localhost:8766; --variant=public|instructor reads the
  matching companion README and writes to a variant-flavoured out_dir.

Both renderers are pure: same input → byte-identical HTML (verified
two-pass against the real release artefacts). Output landing at
release/_preview/<platform>/ is gitignored; committed sample HTML at
release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html
is the audit-artefact-sync gate.

Markdown rendering via markdown-it-py (gfm-like preset, linkify
disabled to avoid the linkify-it-py transitive dep). Missing dep
raises a clean ImportError pointing at pip install -e '.[publish]'.
pyproject.toml ruff per-file-ignores adds E501 for the two scripts —
inlined CSS strings in f-string templates are the product, not source
code that benefits from a 100c wrap.

48 new tests (no live HTTP, no network):
- required field labels (title / subtitle / licence / file count /
  schema column count for Kaggle; pretty_name / licence / configs /
  tags for HF)
- every Markdown link in the source resolves to a non-404 URL pattern
  (no ](../, no ](validation/, only allow-listed external prefixes
  + sibling-relative LICENSE + in-document anchors)
- every configs[] block in the HF YAML round-trips into the rendered
  dropdown
- every CSV / parquet column declared in the Kaggle metadata appears
  in the schema table
- byte-deterministic renderer + audit-sync against the committed sample
- pre-flight error paths (missing artefact, malformed JSON / YAML,
  unknown variant) return rc=2

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 pyproject.toml                                |    5 +
 .../huggingface_instructor.html               |  284 ++++
 .../huggingface_public.html                   |  480 ++++++
 release/_preview_committed/kaggle.html        | 1303 +++++++++++++++++
 scripts/preview_hf_page.py                    |  572 ++++++++
 scripts/preview_kaggle_page.py                |  608 ++++++++
 tests/scripts/test_preview_hf_page.py         |  444 ++++++
 tests/scripts/test_preview_kaggle_page.py     |  416 ++++++
 8 files changed, 4112 insertions(+)
 create mode 100644 release/_preview_committed/huggingface_instructor.html
 create mode 100644 release/_preview_committed/huggingface_public.html
 create mode 100644 release/_preview_committed/kaggle.html
 create mode 100644 scripts/preview_hf_page.py
 create mode 100644 scripts/preview_kaggle_page.py
 create mode 100644 tests/scripts/test_preview_hf_page.py
 create mode 100644 tests/scripts/test_preview_kaggle_page.py

diff --git a/pyproject.toml b/pyproject.toml
index cd31b4f..90a742b 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -107,6 +107,11 @@ select = ["E", "F", "I", "N", "W", "UP", "B", "C4", "PT", "S"]
 # Line length is a property of the rendered cell, not the .py source,
 # so 100c is the wrong yardstick here.
 "scripts/build_release_notebook_*.py" = ["E501"]
+# Preview-page scripts (PR 7.2) carry inlined CSS + multi-attribute
+# HTML strings inside f-string templates; the rendered HTML is the
+# product, so wrapping the source CSS at 100c is line noise.
+"scripts/preview_kaggle_page.py" = ["E501"]
+"scripts/preview_hf_page.py" = ["E501"]
 
 [tool.mypy]
 python_version = "3.11"
diff --git a/release/_preview_committed/huggingface_instructor.html b/release/_preview_committed/huggingface_instructor.html
new file mode 100644
index 0000000..6296a4b
--- /dev/null
+++ b/release/_preview_committed/huggingface_instructor.html
@@ -0,0 +1,284 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <title>HF preview — LeadForge: Synthetic B2B Lead Scoring (v1) — Instructor companion</title>
+  <style>:root { --bg:#fff; --fg:#1f2937; --muted:#6b7280; --accent:#ff9d00; --border:#e5e7eb; --pill-bg:#f3f4f6; --code-bg:#f9fafb; }
+body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, sans-serif; color: var(--fg); background: var(--bg); margin: 0; padding: 0; line-height: 1.6; }
+.container { max-width: 1100px; margin: 0 auto; padding: 24px 32px; }
+.dataset-header { border-bottom: 1px solid var(--border); padding-bottom: 16px; margin-bottom: 24px; }
+.dataset-header__namespace { color: var(--muted); font-size: 0.85em; font-family: monospace; margin-bottom: 4px; }
+.dataset-header__title { font-size: 1.8em; margin: 0 0 12px 0; }
+.dataset-header__pills { list-style: none; padding: 0; margin: 0; display: flex; flex-wrap: wrap; gap: 8px; }
+.pill { background: var(--pill-bg); border-radius: 12px; padding: 4px 12px; font-size: 0.85em; color: var(--fg); }
+.tags { margin: 0 0 24px 0; }
+.chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px 4px 2px 0; font-size: 0.85em; color: var(--fg); }
+.section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; }
+.section__count { color: var(--muted); font-size: 0.7em; font-weight: normal; }
+.config, .file-tree { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; }
+.config__name { cursor: pointer; font-weight: 600; }
+.config__count { color: var(--muted); font-weight: normal; font-size: 0.85em; }
+.badge { display: inline-block; padding: 1px 8px; border-radius: 4px; font-size: 0.75em; font-weight: 600; vertical-align: middle; margin-left: 4px; }
+.badge--default { background: var(--accent); color: white; }
+.config__table { width: 100%; border-collapse: collapse; margin-top: 8px; font-size: 0.9em; }
+.config__table th, .config__table td { text-align: left; padding: 6px 8px; border-bottom: 1px solid var(--border); }
+.config__table th { background: var(--pill-bg); font-weight: 600; }
+.files__list { list-style: none; padding-left: 0; margin: 0; }
+.file { padding: 4px 0; border-bottom: 1px dotted var(--border); }
+.file:last-child { border-bottom: none; }
+.file__config { color: var(--muted); font-size: 0.85em; margin-right: 8px; }
+.file__path { color: var(--accent); }
+.readme { margin: 24px 0; }
+.readme code { background: var(--code-bg); padding: 1px 4px; border-radius: 2px; font-size: 0.9em; }
+.readme pre { background: var(--code-bg); padding: 12px; border-radius: 4px; overflow-x: auto; }
+.readme pre code { background: none; padding: 0; }
+.readme table { border-collapse: collapse; margin: 12px 0; }
+.readme th, .readme td { border: 1px solid var(--border); padding: 6px 10px; text-align: left; }
+.readme blockquote { border-left: 3px solid var(--accent); padding-left: 12px; color: var(--muted); margin: 12px 0; }
+.dataset-footer { margin-top: 48px; padding-top: 16px; border-top: 1px solid var(--border); color: var(--muted); font-size: 0.9em; }
+.dataset-footer__note { font-style: italic; margin-top: 8px; }
+</style>
+</head>
+<body>
+<main class="container">
+<header class="dataset-header">
+  <div class="dataset-header__namespace">huggingface.co/datasets</div>
+  <h1 class="dataset-header__title">LeadForge: Synthetic B2B Lead Scoring (v1) — Instructor companion</h1>
+  <ul class="dataset-header__pills">
+    <li class="pill pill--license">License: mit</li>
+    <li class="pill pill--task">Task: tabular-classification</li>
+    <li class="pill pill--size">Size: 1K&lt;n&lt;10K</li>
+    <li class="pill pill--language">Language: en</li>
+  </ul>
+</header>
+<section class="tags">
+  <span class="chip">b2b</span> <span class="chip">crm</span> <span class="chip">datasets</span> <span class="chip">lead-scoring</span> <span class="chip">pandas</span> <span class="chip">synthetic-data</span> <span class="chip">tabular</span>
+</section>
+<section class="configs">
+  <h2 class="section__heading">Configurations / Subsets <span class="section__count">(1 configs)</span></h2>
+  <details class="config" open>
+    <summary class="config__name"><code>intermediate</code> <span class="badge badge--default">default</span> <span class="config__count">(3 splits)</span></summary>
+    <table class="config__table">
+      <thead><tr><th>Split</th><th>Path</th></tr></thead>
+      <tbody>
+      <tr><td>train</td><td><code>intermediate/tasks/converted_within_90_days/train.parquet</code></td></tr>
+      <tr><td>validation</td><td><code>intermediate/tasks/converted_within_90_days/valid.parquet</code></td></tr>
+      <tr><td>test</td><td><code>intermediate/tasks/converted_within_90_days/test.parquet</code></td></tr>
+      </tbody>
+    </table>
+  </details>
+</section>
+<section class="files">
+  <h2 class="section__heading">Files declared in YAML <span class="section__count">(3 files / variant: instructor)</span></h2>
+  <ul class="files__list">
+    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/train.parquet</code></li>
+    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/valid.parquet</code></li>
+    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/test.parquet</code></li>
+  </ul>
+</section>
+<section class="readme">
+<h1>LeadForge: Synthetic B2B Lead Scoring (v1) — Instructor companion</h1>
+<p>This is the <strong>research / instructor companion</strong> to the public
+<a href="https://huggingface.co/datasets/leadforge/leadforge-lead-scoring-v1"><code>leadforge/leadforge-lead-scoring-v1</code></a>
+dataset.  It exposes the <strong>full-horizon</strong> view of a single difficulty
+tier (<code>intermediate</code>) plus the <strong>hidden causal structure</strong> that the
+public dataset deliberately redacts: the world graph (DAG), latent
+trait registry, mechanism summary, and full-horizon relational tables
+including <code>customers</code> and <code>subscriptions</code>.</p>
+<p>It exists for instructors who want to walk students through how the
+public dataset was generated, and for researchers who want to verify
+that the public redactions actually remove the leakage paths the
+dataset advertises.  <strong>It is not a replacement for the public dataset
+in any teaching or modelling context</strong> — students should still train
+on the public bundle.</p>
+<h2>What this companion contains</h2>
+<pre><code>.
+├── intermediate/                     # research_instructor companion: full-horizon
+│   ├── manifest.json                 # provenance + file hashes
+│   ├── dataset_card.md               # auto-rendered per-bundle card
+│   ├── feature_dictionary.csv        # authoritative column spec
+│   ├── tables/*.parquet              # full-horizon tables (incl. customers, subscriptions)
+│   ├── tasks/converted_within_90_days/{train,valid,test}.parquet
+│   └── metadata/                     # world_spec, graph.{graphml,json}, latent_registry, etc.
+├── README.md                         # this file (HF dataset card)
+├── dataset-cover-image.png           # dataset thumbnail
+└── LICENSE
+</code></pre>
+<p>The single <code>intermediate</code> config exposes the same train/valid/test
+parquet splits as the public dataset's <code>intermediate</code> config — same
+seeds, same row counts (3,500 / 750 / 750), same target.  The
+difference lives in the relational tables and metadata:</p>
+<table>
+<thead>
+<tr>
+<th>File</th>
+<th>Public <code>intermediate</code></th>
+<th>Instructor companion</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><code>tables/leads.parquet</code></td>
+<td>redacted (label dropped)</td>
+<td>full (label retained)</td>
+</tr>
+<tr>
+<td><code>tables/opportunities.parquet</code></td>
+<td>snapshot-filtered + redacted</td>
+<td>full-horizon, full columns</td>
+</tr>
+<tr>
+<td><code>tables/customers.parquet</code></td>
+<td>omitted (would leak label)</td>
+<td>included</td>
+</tr>
+<tr>
+<td><code>tables/subscriptions.parquet</code></td>
+<td>omitted (would leak label)</td>
+<td>included</td>
+</tr>
+<tr>
+<td><code>tables/touches.parquet</code> etc.</td>
+<td>filtered to ≤ snapshot day</td>
+<td>full 90-day horizon</td>
+</tr>
+<tr>
+<td><code>metadata/world_spec.json</code></td>
+<td>absent</td>
+<td>included (DGP + recipe)</td>
+</tr>
+<tr>
+<td><code>metadata/graph.{graphml,json}</code></td>
+<td>absent</td>
+<td>included (hidden DAG)</td>
+</tr>
+<tr>
+<td><code>metadata/latent_registry.json</code></td>
+<td>absent</td>
+<td>included (latent traits)</td>
+</tr>
+<tr>
+<td><code>metadata/mechanism_summary.json</code></td>
+<td>absent</td>
+<td>included (per-edge mechanisms)</td>
+</tr>
+</tbody>
+</table>
+<p>The redaction contract is single-sourced in
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py"><code>leadforge/validation/leakage_probes.py</code></a>
+and re-applied by
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/leadforge/render/relational_snapshot_safe.py"><code>leadforge/render/relational_snapshot_safe.py</code></a>
+when the public bundle is built; this companion is the unfiltered
+source view, so the two are always consistent by construction.</p>
+<h2>Quick start</h2>
+<pre><code class="language-python">from datasets import load_dataset
+
+# Loads the same train/valid/test splits as the public 'intermediate'
+# config; differs only in what `tables/` and `metadata/` provide.
+ds = load_dataset(
+    &quot;leadforge/leadforge-lead-scoring-v1-instructor&quot;,
+    name=&quot;intermediate&quot;,
+)
+train = ds[&quot;train&quot;].to_pandas()
+
+# Full-horizon relational tables — includes customers and subscriptions
+# (omitted from the public dataset because their existence reconstructs
+# the conversion label).
+import pandas as pd
+customers = pd.read_parquet(
+    &quot;hf://datasets/leadforge/leadforge-lead-scoring-v1-instructor/intermediate/tables/customers.parquet&quot;
+)
+</code></pre>
+<h2>Intended uses</h2>
+<ul>
+<li>Teaching the <strong>public-vs-instructor split</strong> itself: load both
+datasets side-by-side, show students which columns and tables were
+redacted, and walk through why each was a leakage path.</li>
+<li><strong>Verifying the redaction contract:</strong> train a model on the
+full-horizon tables, train another on the snapshot-safe public
+tables, compare AUC.  The gap is the redaction's effect.</li>
+<li>Teaching <strong>causal structure and DGP transparency</strong> using
+<code>metadata/world_spec.json</code> + <code>metadata/graph.json</code>.</li>
+<li>Reproducing the public dataset from the instructor view via
+<a href="https://github.com/leadforge-dev/leadforge/blob/main"><code>leadforge</code></a> source code.</li>
+</ul>
+<h2>Out-of-scope uses</h2>
+<ul>
+<li><strong>Production lead scoring.</strong>  Same as the public dataset; the
+company, product, and customers are fictional.</li>
+<li><strong>Modelling with the unredacted view as a baseline.</strong>  Models
+trained against the full-horizon tables look strong because they're
+directly seeing post-conversion events.  That number is not a
+baseline; it's the ceiling.</li>
+<li><strong>Demographic / fairness research.</strong>  v1 does not model protected
+attributes.</li>
+</ul>
+<h2>Composition</h2>
+<ul>
+<li><strong>Entities.</strong>  9 relational tables (accounts, contacts, leads,
+touches, sessions, sales_activities, opportunities, customers,
+subscriptions); per-row counts in <code>manifest.json</code>.</li>
+<li><strong>Splits.</strong>  Identical to the public <code>intermediate</code> config: 70/15/15
+train/valid/test, deterministic given seed 42, recorded in
+<code>tasks/converted_within_90_days/task_manifest.json</code>.</li>
+<li><strong>Provenance.</strong>  Recipe <code>b2b_saas_procurement_v1</code>, seed 42, package
+version stamped in <code>manifest.json</code> along with SHA-256 hashes for
+every parquet file.</li>
+<li><strong>Bundle schema version.</strong>  5 (matches the public dataset).</li>
+</ul>
+<h2>Maintenance, license</h2>
+<p>We <em>want</em> the dataset to be broken.  See the
+<a href="https://huggingface.co/datasets/leadforge/leadforge-lead-scoring-v1">public dataset card</a>
+for the adversarial-framing pointers, the issue templates, and the
+break-me guide.  File issues at
+<a href="https://github.com/leadforge-dev/leadforge">leadforge-dev/leadforge</a>;
+PRs welcome.</p>
+<table>
+<thead>
+<tr>
+<th>Field</th>
+<th>Value</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>Generator</td>
+<td>leadforge <code>1.0.0+</code></td>
+</tr>
+<tr>
+<td>Recipe</td>
+<td><code>b2b_saas_procurement_v1</code></td>
+</tr>
+<tr>
+<td>Canonical seed</td>
+<td>42</td>
+</tr>
+<tr>
+<td>Bundle schema version</td>
+<td>5</td>
+</tr>
+<tr>
+<td>Format</td>
+<td>Parquet (canonical)</td>
+</tr>
+<tr>
+<td>License</td>
+<td>MIT — see <a href="LICENSE">LICENSE</a></td>
+</tr>
+<tr>
+<td>Public dataset</td>
+<td><a href="https://huggingface.co/datasets/leadforge/leadforge-lead-scoring-v1">link</a></td>
+</tr>
+</tbody>
+</table>
+<p>Verify integrity with <code>leadforge validate &lt;bundle_dir&gt;</code>; every file is
+hashed in <code>manifest.json</code>.</p>
+</section>
+<footer class="dataset-footer">
+  <div class="dataset-footer__license">License: mit</div>
+  <div class="dataset-footer__variant">Variant: <code>instructor</code></div>
+  <div class="dataset-footer__note">Local Hugging Face preview rendered by scripts/preview_hf_page.py — not the live dataset page.</div>
+</footer>
+</main>
+</body>
+</html>
diff --git a/release/_preview_committed/huggingface_public.html b/release/_preview_committed/huggingface_public.html
new file mode 100644
index 0000000..3f9006a
--- /dev/null
+++ b/release/_preview_committed/huggingface_public.html
@@ -0,0 +1,480 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <title>HF preview — LeadForge: Synthetic B2B Lead Scoring (v1)</title>
+  <style>:root { --bg:#fff; --fg:#1f2937; --muted:#6b7280; --accent:#ff9d00; --border:#e5e7eb; --pill-bg:#f3f4f6; --code-bg:#f9fafb; }
+body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, sans-serif; color: var(--fg); background: var(--bg); margin: 0; padding: 0; line-height: 1.6; }
+.container { max-width: 1100px; margin: 0 auto; padding: 24px 32px; }
+.dataset-header { border-bottom: 1px solid var(--border); padding-bottom: 16px; margin-bottom: 24px; }
+.dataset-header__namespace { color: var(--muted); font-size: 0.85em; font-family: monospace; margin-bottom: 4px; }
+.dataset-header__title { font-size: 1.8em; margin: 0 0 12px 0; }
+.dataset-header__pills { list-style: none; padding: 0; margin: 0; display: flex; flex-wrap: wrap; gap: 8px; }
+.pill { background: var(--pill-bg); border-radius: 12px; padding: 4px 12px; font-size: 0.85em; color: var(--fg); }
+.tags { margin: 0 0 24px 0; }
+.chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px 4px 2px 0; font-size: 0.85em; color: var(--fg); }
+.section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; }
+.section__count { color: var(--muted); font-size: 0.7em; font-weight: normal; }
+.config, .file-tree { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; }
+.config__name { cursor: pointer; font-weight: 600; }
+.config__count { color: var(--muted); font-weight: normal; font-size: 0.85em; }
+.badge { display: inline-block; padding: 1px 8px; border-radius: 4px; font-size: 0.75em; font-weight: 600; vertical-align: middle; margin-left: 4px; }
+.badge--default { background: var(--accent); color: white; }
+.config__table { width: 100%; border-collapse: collapse; margin-top: 8px; font-size: 0.9em; }
+.config__table th, .config__table td { text-align: left; padding: 6px 8px; border-bottom: 1px solid var(--border); }
+.config__table th { background: var(--pill-bg); font-weight: 600; }
+.files__list { list-style: none; padding-left: 0; margin: 0; }
+.file { padding: 4px 0; border-bottom: 1px dotted var(--border); }
+.file:last-child { border-bottom: none; }
+.file__config { color: var(--muted); font-size: 0.85em; margin-right: 8px; }
+.file__path { color: var(--accent); }
+.readme { margin: 24px 0; }
+.readme code { background: var(--code-bg); padding: 1px 4px; border-radius: 2px; font-size: 0.9em; }
+.readme pre { background: var(--code-bg); padding: 12px; border-radius: 4px; overflow-x: auto; }
+.readme pre code { background: none; padding: 0; }
+.readme table { border-collapse: collapse; margin: 12px 0; }
+.readme th, .readme td { border: 1px solid var(--border); padding: 6px 10px; text-align: left; }
+.readme blockquote { border-left: 3px solid var(--accent); padding-left: 12px; color: var(--muted); margin: 12px 0; }
+.dataset-footer { margin-top: 48px; padding-top: 16px; border-top: 1px solid var(--border); color: var(--muted); font-size: 0.9em; }
+.dataset-footer__note { font-style: italic; margin-top: 8px; }
+</style>
+</head>
+<body>
+<main class="container">
+<header class="dataset-header">
+  <div class="dataset-header__namespace">huggingface.co/datasets</div>
+  <h1 class="dataset-header__title">LeadForge: Synthetic B2B Lead Scoring (v1)</h1>
+  <ul class="dataset-header__pills">
+    <li class="pill pill--license">License: mit</li>
+    <li class="pill pill--task">Task: tabular-classification</li>
+    <li class="pill pill--size">Size: 1K&lt;n&lt;10K</li>
+    <li class="pill pill--language">Language: en</li>
+  </ul>
+</header>
+<section class="tags">
+  <span class="chip">b2b</span> <span class="chip">crm</span> <span class="chip">datasets</span> <span class="chip">lead-scoring</span> <span class="chip">pandas</span> <span class="chip">synthetic-data</span> <span class="chip">tabular</span>
+</section>
+<section class="configs">
+  <h2 class="section__heading">Configurations / Subsets <span class="section__count">(3 configs)</span></h2>
+  <details class="config" open>
+    <summary class="config__name"><code>intro</code> <span class="config__count">(3 splits)</span></summary>
+    <table class="config__table">
+      <thead><tr><th>Split</th><th>Path</th></tr></thead>
+      <tbody>
+      <tr><td>train</td><td><code>intro/tasks/converted_within_90_days/train.parquet</code></td></tr>
+      <tr><td>validation</td><td><code>intro/tasks/converted_within_90_days/valid.parquet</code></td></tr>
+      <tr><td>test</td><td><code>intro/tasks/converted_within_90_days/test.parquet</code></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="config" open>
+    <summary class="config__name"><code>intermediate</code> <span class="badge badge--default">default</span> <span class="config__count">(3 splits)</span></summary>
+    <table class="config__table">
+      <thead><tr><th>Split</th><th>Path</th></tr></thead>
+      <tbody>
+      <tr><td>train</td><td><code>intermediate/tasks/converted_within_90_days/train.parquet</code></td></tr>
+      <tr><td>validation</td><td><code>intermediate/tasks/converted_within_90_days/valid.parquet</code></td></tr>
+      <tr><td>test</td><td><code>intermediate/tasks/converted_within_90_days/test.parquet</code></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="config" open>
+    <summary class="config__name"><code>advanced</code> <span class="config__count">(3 splits)</span></summary>
+    <table class="config__table">
+      <thead><tr><th>Split</th><th>Path</th></tr></thead>
+      <tbody>
+      <tr><td>train</td><td><code>advanced/tasks/converted_within_90_days/train.parquet</code></td></tr>
+      <tr><td>validation</td><td><code>advanced/tasks/converted_within_90_days/valid.parquet</code></td></tr>
+      <tr><td>test</td><td><code>advanced/tasks/converted_within_90_days/test.parquet</code></td></tr>
+      </tbody>
+    </table>
+  </details>
+</section>
+<section class="files">
+  <h2 class="section__heading">Files declared in YAML <span class="section__count">(9 files / variant: public)</span></h2>
+  <ul class="files__list">
+    <li class="file"><span class="file__config">[intro]</span> <code class="file__path">intro/tasks/converted_within_90_days/train.parquet</code></li>
+    <li class="file"><span class="file__config">[intro]</span> <code class="file__path">intro/tasks/converted_within_90_days/valid.parquet</code></li>
+    <li class="file"><span class="file__config">[intro]</span> <code class="file__path">intro/tasks/converted_within_90_days/test.parquet</code></li>
+    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/train.parquet</code></li>
+    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/valid.parquet</code></li>
+    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/test.parquet</code></li>
+    <li class="file"><span class="file__config">[advanced]</span> <code class="file__path">advanced/tasks/converted_within_90_days/train.parquet</code></li>
+    <li class="file"><span class="file__config">[advanced]</span> <code class="file__path">advanced/tasks/converted_within_90_days/valid.parquet</code></li>
+    <li class="file"><span class="file__config">[advanced]</span> <code class="file__path">advanced/tasks/converted_within_90_days/test.parquet</code></li>
+  </ul>
+</section>
+<section class="readme">
+<h1>LeadForge: Synthetic B2B Lead Scoring Dataset (<code>leadforge-lead-scoring-v1</code>)</h1>
+<p>A relational, reproducible, three-tier synthetic CRM dataset family for
+teaching lead scoring at scale. Generated by
+<a href="https://github.com/leadforge-dev/leadforge">leadforge</a>, an
+open-source Python framework for synthetic CRM/funnel data. The
+framework version is decoupled from the dataset version: the package
+stays at <code>1.x</code>; the dataset is published under the explicit <code>…-v1</code>
+tag.</p>
+<h2>Why lead scoring matters in 2024–2026</h2>
+<p>Mid-market SaaS vendors entered 2024–2026 with growth slowing and
+customer-acquisition costs rising[^macro], so predicting <em>which</em> leads
+convert within a fixed window has moved from a marketing nicety to a
+survival skill. This dataset teaches that skill on a relational
+substrate, with the realistic confusions (snapshot-window discipline,
+leakage traps, channel signal weaker than vendor blogs imply) that
+students will hit when they finally get hands on real CRM data.</p>
+<p>[^macro]: Macroeconomic framing summarised in
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md"><code>docs/external_review/summaries/gemini_v2_summary.md</code></a>
+(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio
+rose materially in 2024).</p>
+<h2>What's inside</h2>
+<pre><code>.
+├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier
+│   ├── manifest.json                 # provenance + file hashes
+│   ├── dataset_card.md               # auto-rendered per-bundle card
+│   ├── feature_dictionary.csv        # authoritative column spec
+│   ├── lead_scoring.csv              # flat convenience CSV (all splits)
+│   ├── tables/*.parquet              # 7 snapshot-safe relational tables
+│   └── tasks/converted_within_90_days/{train,valid,test}.parquet
+├── README.md                         # this file (HF dataset card)
+├── dataset-cover-image.png           # dataset thumbnail
+└── LICENSE
+</code></pre>
+<p><code>student_public</code> bundles ship the snapshot-safe relational view;
+<code>research_instructor</code> companions ship the full-horizon view plus the
+hidden causal structure (DAG, latent registry, mechanism summary)
+under <code>metadata/</code>. The full layout is documented in each bundle's
+<code>manifest.json</code>.</p>
+<h2>Quick start</h2>
+<pre><code class="language-python"># Flat CSV
+df = pd.read_csv(&quot;intermediate/lead_scoring.csv&quot;)
+
+# Parquet task splits (recommended)
+train = pd.read_parquet(&quot;intermediate/tasks/converted_within_90_days/train.parquet&quot;)
+test  = pd.read_parquet(&quot;intermediate/tasks/converted_within_90_days/test.parquet&quot;)
+
+# Relational tables (feature engineering — example)
+leads   = pd.read_parquet(&quot;intermediate/tables/leads.parquet&quot;)
+touches = pd.read_parquet(&quot;intermediate/tables/touches.parquet&quot;)
+my_touch_count = (
+    touches.groupby(&quot;lead_id&quot;).size().rename(&quot;my_touch_count&quot;).reset_index()
+)
+features = leads.merge(my_touch_count, on=&quot;lead_id&quot;, how=&quot;left&quot;)
+
+# Reproduce from source
+# pip install leadforge
+# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \
+#                    --mode student_public --difficulty intermediate --out my_bundle
+</code></pre>
+<p>The label <code>converted_within_90_days</code> resolves over a 90-day window;
+engagement features (<code>touch_count</code>, <code>session_count</code>, etc.) are
+computed strictly over events on days <code>[0, 30]</code>. The deliberate
+exception is <code>total_touches_all</code>, the leakage trap — flagged
+<code>leakage_risk=True</code> in <code>feature_dictionary.csv</code>. Drop it from your
+feature set unless you're demonstrating leakage detection.</p>
+<h2>Dataset summary</h2>
+<table>
+<thead>
+<tr>
+<th></th>
+<th>Intro</th>
+<th>Intermediate</th>
+<th>Advanced</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>Leads</td>
+<td>5,000</td>
+<td>5,000</td>
+<td>5,000</td>
+</tr>
+<tr>
+<td>Accounts</td>
+<td>1,500</td>
+<td>1,500</td>
+<td>1,500</td>
+</tr>
+<tr>
+<td>Contacts</td>
+<td>4,200</td>
+<td>4,200</td>
+<td>4,200</td>
+</tr>
+<tr>
+<td>Snapshot columns</td>
+<td>32 / 34*</td>
+<td>32 / 34*</td>
+<td>32 / 34*</td>
+</tr>
+<tr>
+<td>Target</td>
+<td><code>converted_within_90_days</code></td>
+<td><code>converted_within_90_days</code></td>
+<td><code>converted_within_90_days</code></td>
+</tr>
+<tr>
+<td>Conversion rate (acceptance band, gate G7.*)</td>
+<td>24–61%</td>
+<td>12–31%</td>
+<td>4–12%</td>
+</tr>
+<tr>
+<td>Conversion rate (observed median, seeds 42–46)</td>
+<td>42.67%</td>
+<td>21.60%</td>
+<td>8.40%</td>
+</tr>
+<tr>
+<td>Signal strength</td>
+<td>0.90</td>
+<td>0.70</td>
+<td>0.50</td>
+</tr>
+<tr>
+<td>Noise scale</td>
+<td>0.10</td>
+<td>0.30</td>
+<td>0.55</td>
+</tr>
+<tr>
+<td>Missing rate</td>
+<td>2%</td>
+<td>8%</td>
+<td>18%</td>
+</tr>
+</tbody>
+</table>
+<p>* <code>student_public</code> / <code>research_instructor</code>. Difficulty is modulated
+by the simulation engine — signal strength on latent-trait weights,
+Gaussian noise on float features, MCAR missingness, outlier rate —
+not post-hoc label flipping. The acceptance band is the recipe
+gate's tolerance window (<code>v1_acceptance_gates_bands.yaml</code> G7.*),
+not the achievable range — observed five-seed spreads sit
+comfortably inside the band.</p>
+<h2>The scenario</h2>
+<p><strong>Veridian Technologies</strong> is a fictional Series B startup (Austin, US)
+selling <strong>Veridian Procure</strong>, a procurement / AP automation SaaS, to
+mid-market firms (200–2,000 employees) in the US and UK. The funnel
+runs through inbound marketing (45%), SDR outbound (35%), and
+partner referrals (20%); four personas drive deals (VP Finance, AP
+Manager, IT Director, Procurement Manager). <strong>Task:</strong> predict whether
+a lead converts (<code>closed_won</code>) within 90 days. ACV bands are
+$18k–$120k. See
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md"><code>docs/release/generation_method.md</code></a>
+for the full DGP, and the deeper &quot;what's modelled / approximate / not
+modelled&quot; breakdown that this README only summarises.</p>
+<h2>Public vs instructor: what's redacted</h2>
+<p>Filtering happens <strong>during rendering</strong>, not during simulation. The
+redaction contract is single-sourced in
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py"><code>leadforge/validation/leakage_probes.py</code></a>;
+the snapshot-safe writer and the validator import the same constants,
+so they cannot drift apart.</p>
+<table>
+<thead>
+<tr>
+<th>Source-of-truth constant</th>
+<th>Public bundle treatment</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><code>BANNED_LEAD_COLUMNS = (&quot;converted_within_90_days&quot;, &quot;conversion_timestamp&quot;)</code></td>
+<td>Dropped from <code>tables/leads.parquet</code></td>
+</tr>
+<tr>
+<td><code>BANNED_OPP_COLUMNS = (&quot;close_outcome&quot;, &quot;closed_at&quot;)</code></td>
+<td>Dropped from <code>tables/opportunities.parquet</code></td>
+</tr>
+<tr>
+<td><code>BANNED_TABLES = (&quot;customers&quot;, &quot;subscriptions&quot;)</code></td>
+<td>Omitted from public bundles</td>
+</tr>
+<tr>
+<td><code>SNAPSHOT_FILTERED_TABLES</code> (touches, sessions, sales_activities, opportunities)</td>
+<td>Filtered per-lead by <code>lead_created_at + snapshot_day</code></td>
+</tr>
+<tr>
+<td>Snapshot redaction (<code>current_stage</code>, <code>is_sql</code>)</td>
+<td>Stripped from <code>tasks/</code> splits and <code>tables/leads.parquet</code></td>
+</tr>
+<tr>
+<td><code>total_touches_all</code> (deliberate trap)</td>
+<td><strong>Retained in both modes</strong>; flagged <code>leakage_risk=True</code></td>
+</tr>
+</tbody>
+</table>
+<p>Each bundle's <code>manifest.json</code> records <code>relational_snapshot_safe</code>,
+<code>redacted_columns</code>, and <code>snapshot_day</code>, so the bundle is
+self-describing.</p>
+<h2>Calibration</h2>
+<p>Every realism / calibration / difficulty claim in this README is
+backed by
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md"><code>validation/validation_report.md</code></a>,
+regenerated by
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py"><code>scripts/validate_release_candidate.py</code></a>
+with bands declared in
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml"><code>docs/release/v1_acceptance_gates_bands.yaml</code></a>.
+Headline cross-seed medians (seeds 42–46):</p>
+<table>
+<thead>
+<tr>
+<th>Tier</th>
+<th>LR AUC</th>
+<th>AP</th>
+<th>P@100</th>
+<th>Brier</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>intro</td>
+<td>0.879</td>
+<td>0.761</td>
+<td>0.80</td>
+<td>0.130</td>
+</tr>
+<tr>
+<td>intermediate</td>
+<td>0.886</td>
+<td>0.575</td>
+<td>0.59</td>
+<td>0.110</td>
+</tr>
+<tr>
+<td>advanced</td>
+<td>0.886</td>
+<td>0.351</td>
+<td>0.34</td>
+<td>0.061</td>
+</tr>
+</tbody>
+</table>
+<p>AP, P@100, conversion-rate, and lift orderings hold across the
+intended difficulty axis (intro &gt; intermediate &gt; advanced).</p>
+<h2>Intended uses</h2>
+<ul>
+<li>Teaching baseline lead-scoring on a flat snapshot.</li>
+<li>Teaching relational feature engineering against snapshot-safe tables.</li>
+<li>Teaching leakage detection (the <code>total_touches_all</code> trap is
+designed to be discoverable).</li>
+<li>Teaching calibration, lift, P@K, value-aware ranking
+(<code>expected_acv × P(convert)</code>), and cohort-shift evaluation.</li>
+<li>Comparing model families under a controlled DGP.</li>
+</ul>
+<h2>Out-of-scope uses</h2>
+<ul>
+<li><strong>Production lead scoring.</strong> The company, product, and customers are
+fictional.</li>
+<li><strong>Vendor benchmarking / paper baselines.</strong> Difficulty tiers are
+calibrated for pedagogy, not cross-paper comparability.</li>
+<li><strong>Causal-inference research that requires recovery of the true DGP.</strong>
+The instructor companion exposes the hidden graph for teaching, not
+designed counterfactuals.</li>
+<li><strong>Demographic / fairness research.</strong> v1 does not model protected
+attributes.</li>
+</ul>
+<h2>Known limitations</h2>
+<ul>
+<li><strong>Difficulty signal on raw AUC is flat.</strong> LR AUC is ~0.88 across
+every tier. Difficulty is visible in AP, P@K, Brier, and value
+capture. Treat AUC as a sanity check, not a difficulty signal.</li>
+<li><strong>GBM does not consistently beat LR (gate G7.4.4).</strong> GBM−LR AUC delta
+is slightly negative in every tier (intro −0.0045, intermediate
+−0.0072, advanced −0.0133); v1's snapshot is dominated by linear
+features. v2 will inject non-linear interactions in the simulator.</li>
+<li><strong>Channel signal is weak.</strong> Per
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md"><code>docs/release/channel_signal_audit.md</code></a>,
+out-of-sample univariate AUC of <code>lead_source</code> is ≈0.50–0.52 across
+all tiers and the per-channel rate spread is ≤0.05. The simulator
+does not encode channel-conditional probabilities; channel-conditional
+encoding is post-v1 work.</li>
+<li><strong>Cohort-shift degradation is small.</strong> v1 has no time-of-year drift
+baked in; the cohort-shift gate (G6.4) is informational and will
+bite in v2.</li>
+</ul>
+<h2>Composition</h2>
+<ul>
+<li><strong>Entities.</strong> Accounts, contacts, leads, touches, sessions,
+sales_activities, opportunities (public); plus customers and
+subscriptions (instructor only). Per-row counts per bundle live in
+<code>manifest.json</code>.</li>
+<li><strong>Features.</strong> 32 public columns grouped by analytical role in
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md"><code>docs/release/feature_dictionary.md</code></a>;
+the per-bundle <code>feature_dictionary.csv</code> is the authoritative
+machine-readable spec.</li>
+<li><strong>Label.</strong> <code>converted_within_90_days</code> (boolean), event-derived from
+the simulator. Never sampled directly.</li>
+<li><strong>Splits.</strong> 70/15/15 train/valid/test, deterministic given seed;
+recorded in <code>tasks/converted_within_90_days/task_manifest.json</code>.
+<strong>Group-leakage warning:</strong> the splitter is keyed on <code>lead_id</code> only,
+not on <code>account_id</code> or <code>contact_id</code>. On the as-shipped intermediate
+bundle, <strong>518 of 557 test accounts (≈93 %) also appear in train</strong>;
+the contact-level overlap is similar in magnitude. A flat baseline
+trained on the random split rides account-level signal across the
+split boundary. For a generalisation-faithful number, retrain with
+<code>GroupKFold(account_id)</code> (or <code>contact_id</code>) and report both — see
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md"><code>break_me_guide.md</code></a> §5 for the
+detection recipe.</li>
+<li><strong>Provenance.</strong> Recipe <code>b2b_saas_procurement_v1</code>, seed 42, package
+version stamped in <code>manifest.json</code>.</li>
+</ul>
+<h2>Maintenance, adversarial framing, license</h2>
+<p>We <em>want</em> the dataset to be broken. The
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md">break-me guide</a> catalogues
+nine adversarial patterns to look for (leakage, split
+contamination, ranking inversions, calibration drift) with
+worked-example pointers back into the notebooks. Issue
+templates ship under <code>.github/ISSUE_TEMPLATE/</code>: a
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml">breakage report</a>
+form for findings on the bundle itself, and a
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml">realism feedback</a>
+form for distributional critiques. Accepted findings are
+logged in
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md"><code>docs/release/v2_decision_log.md</code></a>.
+File issues at
+<a href="https://github.com/leadforge-dev/leadforge">leadforge-dev/leadforge</a>;
+PRs welcome.</p>
+<table>
+<thead>
+<tr>
+<th>Field</th>
+<th>Value</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>Generator</td>
+<td>leadforge <code>1.0.0+</code></td>
+</tr>
+<tr>
+<td>Recipe</td>
+<td><code>b2b_saas_procurement_v1</code></td>
+</tr>
+<tr>
+<td>Canonical seed</td>
+<td>42 (cross-seed sweep: 42–46)</td>
+</tr>
+<tr>
+<td>Bundle schema version</td>
+<td>5</td>
+</tr>
+<tr>
+<td>Format</td>
+<td>Parquet (canonical) + CSV (convenience)</td>
+</tr>
+<tr>
+<td>License</td>
+<td>MIT — see <a href="LICENSE">LICENSE</a></td>
+</tr>
+</tbody>
+</table>
+<p>Verify integrity with <code>leadforge validate &lt;bundle_dir&gt;</code>; every file
+is hashed in <code>manifest.json</code>.</p>
+</section>
+<footer class="dataset-footer">
+  <div class="dataset-footer__license">License: mit</div>
+  <div class="dataset-footer__variant">Variant: <code>public</code></div>
+  <div class="dataset-footer__note">Local Hugging Face preview rendered by scripts/preview_hf_page.py — not the live dataset page.</div>
+</footer>
+</main>
+</body>
+</html>
diff --git a/release/_preview_committed/kaggle.html b/release/_preview_committed/kaggle.html
new file mode 100644
index 0000000..4c61444
--- /dev/null
+++ b/release/_preview_committed/kaggle.html
@@ -0,0 +1,1303 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <title>Kaggle preview — LeadForge: Synthetic B2B Lead Scoring (v1)</title>
+  <style>:root { --bg:#fff; --fg:#202124; --muted:#5f6368; --accent:#20beff; --border:#e0e0e0; --pill-bg:#f1f3f4; }
+body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; color: var(--fg); background: var(--bg); margin: 0; padding: 0; line-height: 1.5; }
+.container { max-width: 1100px; margin: 0 auto; padding: 24px 32px; }
+.dataset-header { border-bottom: 1px solid var(--border); padding-bottom: 16px; margin-bottom: 24px; }
+.dataset-header__id { color: var(--muted); font-size: 0.85em; font-family: monospace; margin-bottom: 4px; }
+.dataset-header__title { font-size: 1.8em; margin: 0 0 4px 0; }
+.dataset-header__subtitle { color: var(--muted); margin: 0 0 12px 0; }
+.dataset-header__pills { list-style: none; padding: 0; margin: 0; display: flex; flex-wrap: wrap; gap: 8px; }
+.pill { background: var(--pill-bg); border-radius: 12px; padding: 4px 12px; font-size: 0.85em; color: var(--fg); }
+.cover { margin: 0 0 24px 0; border: 1px solid var(--border); border-radius: 4px; overflow: hidden; }
+.cover__image { display: block; max-width: 100%; height: auto; }
+.section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; }
+.section__count { color: var(--muted); font-size: 0.7em; font-weight: normal; }
+.tier, .schema { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; }
+.tier__name, .schema__path { cursor: pointer; font-weight: 600; }
+.tier__count, .schema__count { color: var(--muted); font-weight: normal; font-size: 0.85em; }
+.tier__files { list-style: none; padding: 8px 0 0 0; margin: 0; }
+.file { display: flex; gap: 12px; padding: 4px 0; border-bottom: 1px dotted var(--border); }
+.file:last-child { border-bottom: none; }
+.file__path { color: var(--accent); flex-shrink: 0; }
+.file__desc { color: var(--muted); font-size: 0.9em; }
+.schema__table { width: 100%; border-collapse: collapse; margin-top: 8px; font-size: 0.9em; }
+.schema__table th, .schema__table td { text-align: left; padding: 6px 8px; border-bottom: 1px solid var(--border); vertical-align: top; }
+.schema__table th { background: var(--pill-bg); font-weight: 600; }
+.col__name code { background: none; }
+.col__type { color: var(--muted); font-family: monospace; }
+.description { margin: 24px 0; }
+.description code { background: var(--pill-bg); padding: 1px 4px; border-radius: 2px; font-size: 0.9em; }
+.description pre { background: var(--pill-bg); padding: 12px; border-radius: 4px; overflow-x: auto; }
+.description pre code { background: none; padding: 0; }
+.description table { border-collapse: collapse; margin: 12px 0; }
+.description th, .description td { border: 1px solid var(--border); padding: 6px 10px; text-align: left; }
+.description blockquote { border-left: 3px solid var(--accent); padding-left: 12px; color: var(--muted); margin: 12px 0; }
+.sources__list { padding-left: 20px; }
+.dataset-footer { margin-top: 48px; padding-top: 16px; border-top: 1px solid var(--border); color: var(--muted); font-size: 0.9em; }
+.dataset-footer__keywords { margin-bottom: 8px; }
+.chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px; font-size: 0.85em; }
+.dataset-footer__note { font-style: italic; margin-top: 8px; }
+</style>
+</head>
+<body>
+<main class="container">
+<header class="dataset-header">
+  <div class="dataset-header__id">leadforge/leadforge-lead-scoring-v1</div>
+  <h1 class="dataset-header__title">LeadForge: Synthetic B2B Lead Scoring (v1)</h1>
+  <p class="dataset-header__subtitle">Three-tier synthetic CRM funnel for leakage-aware lead scoring</p>
+  <ul class="dataset-header__pills">
+    <li class="pill pill--license">License: MIT</li>
+    <li class="pill pill--frequency">Updates: never</li>
+    <li class="pill pill--visibility">Visibility: Private</li>
+  </ul>
+</header>
+<section class="cover">
+  <img class="cover__image" src="dataset-cover-image.png" alt="Dataset cover image">
+</section>
+<section class="description">
+<h1>LeadForge: Synthetic B2B Lead Scoring Dataset (<code>leadforge-lead-scoring-v1</code>)</h1>
+<p>A relational, reproducible, three-tier synthetic CRM dataset family for
+teaching lead scoring at scale. Generated by
+<a href="https://github.com/leadforge-dev/leadforge">leadforge</a>, an
+open-source Python framework for synthetic CRM/funnel data. The
+framework version is decoupled from the dataset version: the package
+stays at <code>1.x</code>; the dataset is published under the explicit <code>…-v1</code>
+tag.</p>
+<h2>Why lead scoring matters in 2024–2026</h2>
+<p>Mid-market SaaS vendors entered 2024–2026 with growth slowing and
+customer-acquisition costs rising[^macro], so predicting <em>which</em> leads
+convert within a fixed window has moved from a marketing nicety to a
+survival skill. This dataset teaches that skill on a relational
+substrate, with the realistic confusions (snapshot-window discipline,
+leakage traps, channel signal weaker than vendor blogs imply) that
+students will hit when they finally get hands on real CRM data.</p>
+<p>[^macro]: Macroeconomic framing summarised in
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md"><code>docs/external_review/summaries/gemini_v2_summary.md</code></a>
+(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio
+rose materially in 2024).</p>
+<h2>What's inside</h2>
+<pre><code>.
+├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier
+│   ├── manifest.json                 # provenance + file hashes
+│   ├── dataset_card.md               # auto-rendered per-bundle card
+│   ├── feature_dictionary.csv        # authoritative column spec
+│   ├── lead_scoring.csv              # flat convenience CSV (all splits)
+│   ├── tables/*.parquet              # 7 snapshot-safe relational tables
+│   └── tasks/converted_within_90_days/{train,valid,test}.parquet
+├── dataset-metadata.json             # Kaggle dataset metadata
+├── dataset-cover-image.png           # Kaggle cover image
+├── README.md                         # Kaggle package README
+└── LICENSE
+</code></pre>
+<p><code>student_public</code> bundles ship the snapshot-safe relational view;
+<code>research_instructor</code> companions ship the full-horizon view plus the
+hidden causal structure (DAG, latent registry, mechanism summary)
+under <code>metadata/</code>. The full layout is documented in each bundle's
+<code>manifest.json</code>.</p>
+<h2>Quick start</h2>
+<pre><code class="language-python"># Flat CSV
+df = pd.read_csv(&quot;intermediate/lead_scoring.csv&quot;)
+
+# Parquet task splits (recommended)
+train = pd.read_parquet(&quot;intermediate/tasks/converted_within_90_days/train.parquet&quot;)
+test  = pd.read_parquet(&quot;intermediate/tasks/converted_within_90_days/test.parquet&quot;)
+
+# Relational tables (feature engineering — example)
+leads   = pd.read_parquet(&quot;intermediate/tables/leads.parquet&quot;)
+touches = pd.read_parquet(&quot;intermediate/tables/touches.parquet&quot;)
+my_touch_count = (
+    touches.groupby(&quot;lead_id&quot;).size().rename(&quot;my_touch_count&quot;).reset_index()
+)
+features = leads.merge(my_touch_count, on=&quot;lead_id&quot;, how=&quot;left&quot;)
+
+# Reproduce from source
+# pip install leadforge
+# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \
+#                    --mode student_public --difficulty intermediate --out my_bundle
+</code></pre>
+<p>The label <code>converted_within_90_days</code> resolves over a 90-day window;
+engagement features (<code>touch_count</code>, <code>session_count</code>, etc.) are
+computed strictly over events on days <code>[0, 30]</code>. The deliberate
+exception is <code>total_touches_all</code>, the leakage trap — flagged
+<code>leakage_risk=True</code> in <code>feature_dictionary.csv</code>. Drop it from your
+feature set unless you're demonstrating leakage detection.</p>
+<h2>Dataset summary</h2>
+<table>
+<thead>
+<tr>
+<th></th>
+<th>Intro</th>
+<th>Intermediate</th>
+<th>Advanced</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>Leads</td>
+<td>5,000</td>
+<td>5,000</td>
+<td>5,000</td>
+</tr>
+<tr>
+<td>Accounts</td>
+<td>1,500</td>
+<td>1,500</td>
+<td>1,500</td>
+</tr>
+<tr>
+<td>Contacts</td>
+<td>4,200</td>
+<td>4,200</td>
+<td>4,200</td>
+</tr>
+<tr>
+<td>Snapshot columns</td>
+<td>32 / 34*</td>
+<td>32 / 34*</td>
+<td>32 / 34*</td>
+</tr>
+<tr>
+<td>Target</td>
+<td><code>converted_within_90_days</code></td>
+<td><code>converted_within_90_days</code></td>
+<td><code>converted_within_90_days</code></td>
+</tr>
+<tr>
+<td>Conversion rate (acceptance band, gate G7.*)</td>
+<td>24–61%</td>
+<td>12–31%</td>
+<td>4–12%</td>
+</tr>
+<tr>
+<td>Conversion rate (observed median, seeds 42–46)</td>
+<td>42.67%</td>
+<td>21.60%</td>
+<td>8.40%</td>
+</tr>
+<tr>
+<td>Signal strength</td>
+<td>0.90</td>
+<td>0.70</td>
+<td>0.50</td>
+</tr>
+<tr>
+<td>Noise scale</td>
+<td>0.10</td>
+<td>0.30</td>
+<td>0.55</td>
+</tr>
+<tr>
+<td>Missing rate</td>
+<td>2%</td>
+<td>8%</td>
+<td>18%</td>
+</tr>
+</tbody>
+</table>
+<p>* <code>student_public</code> / <code>research_instructor</code>. Difficulty is modulated
+by the simulation engine — signal strength on latent-trait weights,
+Gaussian noise on float features, MCAR missingness, outlier rate —
+not post-hoc label flipping. The acceptance band is the recipe
+gate's tolerance window (<code>v1_acceptance_gates_bands.yaml</code> G7.*),
+not the achievable range — observed five-seed spreads sit
+comfortably inside the band.</p>
+<h2>The scenario</h2>
+<p><strong>Veridian Technologies</strong> is a fictional Series B startup (Austin, US)
+selling <strong>Veridian Procure</strong>, a procurement / AP automation SaaS, to
+mid-market firms (200–2,000 employees) in the US and UK. The funnel
+runs through inbound marketing (45%), SDR outbound (35%), and
+partner referrals (20%); four personas drive deals (VP Finance, AP
+Manager, IT Director, Procurement Manager). <strong>Task:</strong> predict whether
+a lead converts (<code>closed_won</code>) within 90 days. ACV bands are
+$18k–$120k. See
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md"><code>docs/release/generation_method.md</code></a>
+for the full DGP, and the deeper &quot;what's modelled / approximate / not
+modelled&quot; breakdown that this README only summarises.</p>
+<h2>Public vs instructor: what's redacted</h2>
+<p>Filtering happens <strong>during rendering</strong>, not during simulation. The
+redaction contract is single-sourced in
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py"><code>leadforge/validation/leakage_probes.py</code></a>;
+the snapshot-safe writer and the validator import the same constants,
+so they cannot drift apart.</p>
+<table>
+<thead>
+<tr>
+<th>Source-of-truth constant</th>
+<th>Public bundle treatment</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><code>BANNED_LEAD_COLUMNS = (&quot;converted_within_90_days&quot;, &quot;conversion_timestamp&quot;)</code></td>
+<td>Dropped from <code>tables/leads.parquet</code></td>
+</tr>
+<tr>
+<td><code>BANNED_OPP_COLUMNS = (&quot;close_outcome&quot;, &quot;closed_at&quot;)</code></td>
+<td>Dropped from <code>tables/opportunities.parquet</code></td>
+</tr>
+<tr>
+<td><code>BANNED_TABLES = (&quot;customers&quot;, &quot;subscriptions&quot;)</code></td>
+<td>Omitted from public bundles</td>
+</tr>
+<tr>
+<td><code>SNAPSHOT_FILTERED_TABLES</code> (touches, sessions, sales_activities, opportunities)</td>
+<td>Filtered per-lead by <code>lead_created_at + snapshot_day</code></td>
+</tr>
+<tr>
+<td>Snapshot redaction (<code>current_stage</code>, <code>is_sql</code>)</td>
+<td>Stripped from <code>tasks/</code> splits and <code>tables/leads.parquet</code></td>
+</tr>
+<tr>
+<td><code>total_touches_all</code> (deliberate trap)</td>
+<td><strong>Retained in both modes</strong>; flagged <code>leakage_risk=True</code></td>
+</tr>
+</tbody>
+</table>
+<p>Each bundle's <code>manifest.json</code> records <code>relational_snapshot_safe</code>,
+<code>redacted_columns</code>, and <code>snapshot_day</code>, so the bundle is
+self-describing.</p>
+<h2>Calibration</h2>
+<p>Every realism / calibration / difficulty claim in this README is
+backed by
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md"><code>validation/validation_report.md</code></a>,
+regenerated by
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py"><code>scripts/validate_release_candidate.py</code></a>
+with bands declared in
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml"><code>docs/release/v1_acceptance_gates_bands.yaml</code></a>.
+Headline cross-seed medians (seeds 42–46):</p>
+<table>
+<thead>
+<tr>
+<th>Tier</th>
+<th>LR AUC</th>
+<th>AP</th>
+<th>P@100</th>
+<th>Brier</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>intro</td>
+<td>0.879</td>
+<td>0.761</td>
+<td>0.80</td>
+<td>0.130</td>
+</tr>
+<tr>
+<td>intermediate</td>
+<td>0.886</td>
+<td>0.575</td>
+<td>0.59</td>
+<td>0.110</td>
+</tr>
+<tr>
+<td>advanced</td>
+<td>0.886</td>
+<td>0.351</td>
+<td>0.34</td>
+<td>0.061</td>
+</tr>
+</tbody>
+</table>
+<p>AP, P@100, conversion-rate, and lift orderings hold across the
+intended difficulty axis (intro &gt; intermediate &gt; advanced).</p>
+<h2>Intended uses</h2>
+<ul>
+<li>Teaching baseline lead-scoring on a flat snapshot.</li>
+<li>Teaching relational feature engineering against snapshot-safe tables.</li>
+<li>Teaching leakage detection (the <code>total_touches_all</code> trap is
+designed to be discoverable).</li>
+<li>Teaching calibration, lift, P@K, value-aware ranking
+(<code>expected_acv × P(convert)</code>), and cohort-shift evaluation.</li>
+<li>Comparing model families under a controlled DGP.</li>
+</ul>
+<h2>Out-of-scope uses</h2>
+<ul>
+<li><strong>Production lead scoring.</strong> The company, product, and customers are
+fictional.</li>
+<li><strong>Vendor benchmarking / paper baselines.</strong> Difficulty tiers are
+calibrated for pedagogy, not cross-paper comparability.</li>
+<li><strong>Causal-inference research that requires recovery of the true DGP.</strong>
+The instructor companion exposes the hidden graph for teaching, not
+designed counterfactuals.</li>
+<li><strong>Demographic / fairness research.</strong> v1 does not model protected
+attributes.</li>
+</ul>
+<h2>Known limitations</h2>
+<ul>
+<li><strong>Difficulty signal on raw AUC is flat.</strong> LR AUC is ~0.88 across
+every tier. Difficulty is visible in AP, P@K, Brier, and value
+capture. Treat AUC as a sanity check, not a difficulty signal.</li>
+<li><strong>GBM does not consistently beat LR (gate G7.4.4).</strong> GBM−LR AUC delta
+is slightly negative in every tier (intro −0.0045, intermediate
+−0.0072, advanced −0.0133); v1's snapshot is dominated by linear
+features. v2 will inject non-linear interactions in the simulator.</li>
+<li><strong>Channel signal is weak.</strong> Per
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md"><code>docs/release/channel_signal_audit.md</code></a>,
+out-of-sample univariate AUC of <code>lead_source</code> is ≈0.50–0.52 across
+all tiers and the per-channel rate spread is ≤0.05. The simulator
+does not encode channel-conditional probabilities; channel-conditional
+encoding is post-v1 work.</li>
+<li><strong>Cohort-shift degradation is small.</strong> v1 has no time-of-year drift
+baked in; the cohort-shift gate (G6.4) is informational and will
+bite in v2.</li>
+</ul>
+<h2>Composition</h2>
+<ul>
+<li><strong>Entities.</strong> Accounts, contacts, leads, touches, sessions,
+sales_activities, opportunities (public); plus customers and
+subscriptions (instructor only). Per-row counts per bundle live in
+<code>manifest.json</code>.</li>
+<li><strong>Features.</strong> 32 public columns grouped by analytical role in
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md"><code>docs/release/feature_dictionary.md</code></a>;
+the per-bundle <code>feature_dictionary.csv</code> is the authoritative
+machine-readable spec.</li>
+<li><strong>Label.</strong> <code>converted_within_90_days</code> (boolean), event-derived from
+the simulator. Never sampled directly.</li>
+<li><strong>Splits.</strong> 70/15/15 train/valid/test, deterministic given seed;
+recorded in <code>tasks/converted_within_90_days/task_manifest.json</code>.
+<strong>Group-leakage warning:</strong> the splitter is keyed on <code>lead_id</code> only,
+not on <code>account_id</code> or <code>contact_id</code>. On the as-shipped intermediate
+bundle, <strong>518 of 557 test accounts (≈93 %) also appear in train</strong>;
+the contact-level overlap is similar in magnitude. A flat baseline
+trained on the random split rides account-level signal across the
+split boundary. For a generalisation-faithful number, retrain with
+<code>GroupKFold(account_id)</code> (or <code>contact_id</code>) and report both — see
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md"><code>break_me_guide.md</code></a> §5 for the
+detection recipe.</li>
+<li><strong>Provenance.</strong> Recipe <code>b2b_saas_procurement_v1</code>, seed 42, package
+version stamped in <code>manifest.json</code>.</li>
+</ul>
+<h2>Maintenance, adversarial framing, license</h2>
+<p>We <em>want</em> the dataset to be broken. The
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md">break-me guide</a> catalogues
+nine adversarial patterns to look for (leakage, split
+contamination, ranking inversions, calibration drift) with
+worked-example pointers back into the notebooks. Issue
+templates ship under <code>.github/ISSUE_TEMPLATE/</code>: a
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml">breakage report</a>
+form for findings on the bundle itself, and a
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml">realism feedback</a>
+form for distributional critiques. Accepted findings are
+logged in
+<a href="https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md"><code>docs/release/v2_decision_log.md</code></a>.
+File issues at
+<a href="https://github.com/leadforge-dev/leadforge">leadforge-dev/leadforge</a>;
+PRs welcome.</p>
+<table>
+<thead>
+<tr>
+<th>Field</th>
+<th>Value</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>Generator</td>
+<td>leadforge <code>1.0.0+</code></td>
+</tr>
+<tr>
+<td>Recipe</td>
+<td><code>b2b_saas_procurement_v1</code></td>
+</tr>
+<tr>
+<td>Canonical seed</td>
+<td>42 (cross-seed sweep: 42–46)</td>
+</tr>
+<tr>
+<td>Bundle schema version</td>
+<td>5</td>
+</tr>
+<tr>
+<td>Format</td>
+<td>Parquet (canonical) + CSV (convenience)</td>
+</tr>
+<tr>
+<td>License</td>
+<td>MIT — see <a href="LICENSE">LICENSE</a></td>
+</tr>
+</tbody>
+</table>
+<p>Verify integrity with <code>leadforge validate &lt;bundle_dir&gt;</code>; every file
+is hashed in <code>manifest.json</code>.</p>
+</section>
+<section class="files">
+  <h2 class="section__heading">Data Files <span class="section__count">(42 total)</span></h2>
+  <details class="tier" open>
+    <summary class="tier__name">intro/ <span class="tier__count">(14 files)</span></summary>
+    <ul class="tier__files">
+    <li class="file"><code class="file__path">intro/lead_scoring.csv</code><span class="file__desc">Intro tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.</span></li>
+    <li class="file"><code class="file__path">intro/feature_dictionary.csv</code><span class="file__desc">Intro tier feature dictionary (canonical column spec).</span></li>
+    <li class="file"><code class="file__path">intro/tasks/converted_within_90_days/train.parquet</code><span class="file__desc">Intro tier train split for `converted_within_90_days` (3,500 rows).</span></li>
+    <li class="file"><code class="file__path">intro/tasks/converted_within_90_days/valid.parquet</code><span class="file__desc">Intro tier valid split for `converted_within_90_days` (750 rows).</span></li>
+    <li class="file"><code class="file__path">intro/tasks/converted_within_90_days/test.parquet</code><span class="file__desc">Intro tier test split for `converted_within_90_days` (750 rows).</span></li>
+    <li class="file"><code class="file__path">intro/tables/accounts.parquet</code><span class="file__desc">Intro tier `accounts` relational table (1,500 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intro/tables/contacts.parquet</code><span class="file__desc">Intro tier `contacts` relational table (4,200 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intro/tables/leads.parquet</code><span class="file__desc">Intro tier `leads` relational table (5,000 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intro/tables/touches.parquet</code><span class="file__desc">Intro tier `touches` relational table (38,561 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intro/tables/sessions.parquet</code><span class="file__desc">Intro tier `sessions` relational table (10,171 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intro/tables/sales_activities.parquet</code><span class="file__desc">Intro tier `sales_activities` relational table (21,358 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intro/tables/opportunities.parquet</code><span class="file__desc">Intro tier `opportunities` relational table (4,426 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intro/dataset_card.md</code><span class="file__desc">Intro tier auto-rendered dataset card.</span></li>
+    <li class="file"><code class="file__path">intro/manifest.json</code><span class="file__desc">Intro tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).</span></li>
+    </ul>
+  </details>
+  <details class="tier" open>
+    <summary class="tier__name">intermediate/ <span class="tier__count">(14 files)</span></summary>
+    <ul class="tier__files">
+    <li class="file"><code class="file__path">intermediate/lead_scoring.csv</code><span class="file__desc">Intermediate tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.</span></li>
+    <li class="file"><code class="file__path">intermediate/feature_dictionary.csv</code><span class="file__desc">Intermediate tier feature dictionary (canonical column spec).</span></li>
+    <li class="file"><code class="file__path">intermediate/tasks/converted_within_90_days/train.parquet</code><span class="file__desc">Intermediate tier train split for `converted_within_90_days` (3,500 rows).</span></li>
+    <li class="file"><code class="file__path">intermediate/tasks/converted_within_90_days/valid.parquet</code><span class="file__desc">Intermediate tier valid split for `converted_within_90_days` (750 rows).</span></li>
+    <li class="file"><code class="file__path">intermediate/tasks/converted_within_90_days/test.parquet</code><span class="file__desc">Intermediate tier test split for `converted_within_90_days` (750 rows).</span></li>
+    <li class="file"><code class="file__path">intermediate/tables/accounts.parquet</code><span class="file__desc">Intermediate tier `accounts` relational table (1,500 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intermediate/tables/contacts.parquet</code><span class="file__desc">Intermediate tier `contacts` relational table (4,200 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intermediate/tables/leads.parquet</code><span class="file__desc">Intermediate tier `leads` relational table (5,000 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intermediate/tables/touches.parquet</code><span class="file__desc">Intermediate tier `touches` relational table (38,724 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intermediate/tables/sessions.parquet</code><span class="file__desc">Intermediate tier `sessions` relational table (10,012 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intermediate/tables/sales_activities.parquet</code><span class="file__desc">Intermediate tier `sales_activities` relational table (20,679 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intermediate/tables/opportunities.parquet</code><span class="file__desc">Intermediate tier `opportunities` relational table (4,255 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">intermediate/dataset_card.md</code><span class="file__desc">Intermediate tier auto-rendered dataset card.</span></li>
+    <li class="file"><code class="file__path">intermediate/manifest.json</code><span class="file__desc">Intermediate tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).</span></li>
+    </ul>
+  </details>
+  <details class="tier" open>
+    <summary class="tier__name">advanced/ <span class="tier__count">(14 files)</span></summary>
+    <ul class="tier__files">
+    <li class="file"><code class="file__path">advanced/lead_scoring.csv</code><span class="file__desc">Advanced tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.</span></li>
+    <li class="file"><code class="file__path">advanced/feature_dictionary.csv</code><span class="file__desc">Advanced tier feature dictionary (canonical column spec).</span></li>
+    <li class="file"><code class="file__path">advanced/tasks/converted_within_90_days/train.parquet</code><span class="file__desc">Advanced tier train split for `converted_within_90_days` (3,500 rows).</span></li>
+    <li class="file"><code class="file__path">advanced/tasks/converted_within_90_days/valid.parquet</code><span class="file__desc">Advanced tier valid split for `converted_within_90_days` (750 rows).</span></li>
+    <li class="file"><code class="file__path">advanced/tasks/converted_within_90_days/test.parquet</code><span class="file__desc">Advanced tier test split for `converted_within_90_days` (750 rows).</span></li>
+    <li class="file"><code class="file__path">advanced/tables/accounts.parquet</code><span class="file__desc">Advanced tier `accounts` relational table (1,500 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">advanced/tables/contacts.parquet</code><span class="file__desc">Advanced tier `contacts` relational table (4,200 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">advanced/tables/leads.parquet</code><span class="file__desc">Advanced tier `leads` relational table (5,000 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">advanced/tables/touches.parquet</code><span class="file__desc">Advanced tier `touches` relational table (38,208 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">advanced/tables/sessions.parquet</code><span class="file__desc">Advanced tier `sessions` relational table (9,942 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">advanced/tables/sales_activities.parquet</code><span class="file__desc">Advanced tier `sales_activities` relational table (19,995 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">advanced/tables/opportunities.parquet</code><span class="file__desc">Advanced tier `opportunities` relational table (4,004 rows) — snapshot-safe.</span></li>
+    <li class="file"><code class="file__path">advanced/dataset_card.md</code><span class="file__desc">Advanced tier auto-rendered dataset card.</span></li>
+    <li class="file"><code class="file__path">advanced/manifest.json</code><span class="file__desc">Advanced tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).</span></li>
+    </ul>
+  </details>
+</section>
+<section class="schemas">
+  <h2 class="section__heading">Schema / Columns <span class="section__count">(534 columns across 33 tabular files)</span></h2>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intro/lead_scoring.csv</code> <span class="schema__count">(33 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>split</code></td><td class="col__type">string</td><td class="col__desc">Task-split membership: one of `train`, `valid`, `test`. Matches the per-row split assignment in `tasks/converted_within_90_days/`.</td></tr>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc">Opaque account identifier.</td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc">Industry vertical of the buying organization.</td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc">Geographic region of the account&#39;s headquarters.</td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc">Banded employee headcount of the account.</td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc">Banded estimated annual revenue of the account.</td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc">Banded internal process maturity score (latent).</td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc">Opaque contact identifier.</td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc">Functional area of the primary contact (e.g. finance, ops).</td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc">Seniority band of the primary contact.</td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc">Buyer role classification (economic_buyer, champion, etc.).</td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc">Opaque lead identifier.</td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc">ISO-8601 timestamp when the lead was created.</td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc">Origination source of the lead (e.g. inbound_form, sdr_outbound).</td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc">Marketing channel responsible for the first recorded touch.</td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">integer</td><td class="col__desc">Total number of marketing/sales touches recorded before snapshot.</td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of inbound touches before snapshot.</td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of outbound touches before snapshot.</td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of web/trial sessions recorded before snapshot.</td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">integer</td><td class="col__desc">Cumulative pricing page views across all sessions before snapshot.</td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">integer</td><td class="col__desc">Cumulative demo page views across all sessions before snapshot.</td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">integer</td><td class="col__desc">Sum of session durations (seconds) before snapshot.</td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">integer</td><td class="col__desc">Number of touches in the first 7 days after lead creation.</td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">integer</td><td class="col__desc">Number of touches in the last 7 days before snapshot cutoff.</td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc">Days between first touch and snapshot cutoff (NaN if no touches).</td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of sales activities logged before snapshot.</td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc">Days elapsed between most recent touch and snapshot cutoff.</td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc">Whether any opportunity was created by snapshot date (open or closed).</td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc">Whether an open opportunity existed at snapshot date.</td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc">Estimated ACV of the most recent open opportunity (NaN if none).</td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc">Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).</td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">integer</td><td class="col__desc">Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.</td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc">Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.</td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intro/tasks/converted_within_90_days/train.parquet</code> <span class="schema__count">(32 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intro/tasks/converted_within_90_days/valid.parquet</code> <span class="schema__count">(32 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intro/tasks/converted_within_90_days/test.parquet</code> <span class="schema__count">(32 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intro/tables/accounts.parquet</code> <span class="schema__count">(8 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>company_name</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intro/tables/contacts.parquet</code> <span class="schema__count">(8 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>job_title</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>email_domain_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intro/tables/leads.parquet</code> <span class="schema__count">(7 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>owner_rep_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intro/tables/touches.parquet</code> <span class="schema__count">(7 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>touch_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_timestamp</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_direction</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>campaign_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intro/tables/sessions.parquet</code> <span class="schema__count">(8 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>session_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_timestamp</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>page_views</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_duration_seconds</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intro/tables/sales_activities.parquet</code> <span class="schema__count">(6 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>activity_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>rep_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_timestamp</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_outcome</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intro/tables/opportunities.parquet</code> <span class="schema__count">(5 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>opportunity_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>stage</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_acv</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intermediate/lead_scoring.csv</code> <span class="schema__count">(33 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>split</code></td><td class="col__type">string</td><td class="col__desc">Task-split membership: one of `train`, `valid`, `test`. Matches the per-row split assignment in `tasks/converted_within_90_days/`.</td></tr>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc">Opaque account identifier.</td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc">Industry vertical of the buying organization.</td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc">Geographic region of the account&#39;s headquarters.</td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc">Banded employee headcount of the account.</td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc">Banded estimated annual revenue of the account.</td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc">Banded internal process maturity score (latent).</td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc">Opaque contact identifier.</td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc">Functional area of the primary contact (e.g. finance, ops).</td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc">Seniority band of the primary contact.</td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc">Buyer role classification (economic_buyer, champion, etc.).</td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc">Opaque lead identifier.</td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc">ISO-8601 timestamp when the lead was created.</td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc">Origination source of the lead (e.g. inbound_form, sdr_outbound).</td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc">Marketing channel responsible for the first recorded touch.</td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">integer</td><td class="col__desc">Total number of marketing/sales touches recorded before snapshot.</td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of inbound touches before snapshot.</td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of outbound touches before snapshot.</td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of web/trial sessions recorded before snapshot.</td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">integer</td><td class="col__desc">Cumulative pricing page views across all sessions before snapshot.</td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">integer</td><td class="col__desc">Cumulative demo page views across all sessions before snapshot.</td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">integer</td><td class="col__desc">Sum of session durations (seconds) before snapshot.</td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">integer</td><td class="col__desc">Number of touches in the first 7 days after lead creation.</td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">integer</td><td class="col__desc">Number of touches in the last 7 days before snapshot cutoff.</td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc">Days between first touch and snapshot cutoff (NaN if no touches).</td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of sales activities logged before snapshot.</td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc">Days elapsed between most recent touch and snapshot cutoff.</td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc">Whether any opportunity was created by snapshot date (open or closed).</td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc">Whether an open opportunity existed at snapshot date.</td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc">Estimated ACV of the most recent open opportunity (NaN if none).</td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc">Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).</td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">integer</td><td class="col__desc">Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.</td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc">Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.</td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intermediate/tasks/converted_within_90_days/train.parquet</code> <span class="schema__count">(32 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intermediate/tasks/converted_within_90_days/valid.parquet</code> <span class="schema__count">(32 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intermediate/tasks/converted_within_90_days/test.parquet</code> <span class="schema__count">(32 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intermediate/tables/accounts.parquet</code> <span class="schema__count">(8 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>company_name</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intermediate/tables/contacts.parquet</code> <span class="schema__count">(8 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>job_title</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>email_domain_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intermediate/tables/leads.parquet</code> <span class="schema__count">(7 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>owner_rep_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intermediate/tables/touches.parquet</code> <span class="schema__count">(7 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>touch_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_timestamp</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_direction</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>campaign_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intermediate/tables/sessions.parquet</code> <span class="schema__count">(8 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>session_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_timestamp</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>page_views</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_duration_seconds</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intermediate/tables/sales_activities.parquet</code> <span class="schema__count">(6 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>activity_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>rep_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_timestamp</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_outcome</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>intermediate/tables/opportunities.parquet</code> <span class="schema__count">(5 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>opportunity_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>stage</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_acv</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>advanced/lead_scoring.csv</code> <span class="schema__count">(33 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>split</code></td><td class="col__type">string</td><td class="col__desc">Task-split membership: one of `train`, `valid`, `test`. Matches the per-row split assignment in `tasks/converted_within_90_days/`.</td></tr>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc">Opaque account identifier.</td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc">Industry vertical of the buying organization.</td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc">Geographic region of the account&#39;s headquarters.</td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc">Banded employee headcount of the account.</td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc">Banded estimated annual revenue of the account.</td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc">Banded internal process maturity score (latent).</td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc">Opaque contact identifier.</td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc">Functional area of the primary contact (e.g. finance, ops).</td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc">Seniority band of the primary contact.</td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc">Buyer role classification (economic_buyer, champion, etc.).</td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc">Opaque lead identifier.</td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc">ISO-8601 timestamp when the lead was created.</td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc">Origination source of the lead (e.g. inbound_form, sdr_outbound).</td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc">Marketing channel responsible for the first recorded touch.</td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">integer</td><td class="col__desc">Total number of marketing/sales touches recorded before snapshot.</td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of inbound touches before snapshot.</td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of outbound touches before snapshot.</td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of web/trial sessions recorded before snapshot.</td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">integer</td><td class="col__desc">Cumulative pricing page views across all sessions before snapshot.</td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">integer</td><td class="col__desc">Cumulative demo page views across all sessions before snapshot.</td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">integer</td><td class="col__desc">Sum of session durations (seconds) before snapshot.</td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">integer</td><td class="col__desc">Number of touches in the first 7 days after lead creation.</td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">integer</td><td class="col__desc">Number of touches in the last 7 days before snapshot cutoff.</td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc">Days between first touch and snapshot cutoff (NaN if no touches).</td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">integer</td><td class="col__desc">Number of sales activities logged before snapshot.</td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc">Days elapsed between most recent touch and snapshot cutoff.</td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc">Whether any opportunity was created by snapshot date (open or closed).</td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc">Whether an open opportunity existed at snapshot date.</td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc">Estimated ACV of the most recent open opportunity (NaN if none).</td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc">Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).</td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">integer</td><td class="col__desc">Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.</td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc">Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.</td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>advanced/tasks/converted_within_90_days/train.parquet</code> <span class="schema__count">(32 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>advanced/tasks/converted_within_90_days/valid.parquet</code> <span class="schema__count">(32 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>advanced/tasks/converted_within_90_days/test.parquet</code> <span class="schema__count">(32 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>inbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>outbound_touch_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_session_duration_seconds</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_week_1</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touches_last_7_days</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_first_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_count</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>days_since_last_touch</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_created</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>has_open_opportunity</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>opportunity_estimated_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>expected_acv</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>total_touches_all</code></td><td class="col__type">number</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>converted_within_90_days</code></td><td class="col__type">boolean</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>advanced/tables/accounts.parquet</code> <span class="schema__count">(8 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>company_name</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>industry</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>region</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>employee_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_revenue_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>process_maturity_band</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>advanced/tables/contacts.parquet</code> <span class="schema__count">(8 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>job_title</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>role_function</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>seniority</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>buyer_role</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>email_domain_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>advanced/tables/leads.parquet</code> <span class="schema__count">(7 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>contact_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>account_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_source</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>first_touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>owner_rep_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>advanced/tables/touches.parquet</code> <span class="schema__count">(7 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>touch_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_timestamp</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_channel</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>touch_direction</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>campaign_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>advanced/tables/sessions.parquet</code> <span class="schema__count">(8 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>session_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_timestamp</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>page_views</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>pricing_page_views</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>demo_page_views</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>session_duration_seconds</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>advanced/tables/sales_activities.parquet</code> <span class="schema__count">(6 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>activity_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>rep_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_timestamp</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_type</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>activity_outcome</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+  <details class="schema" open>
+    <summary class="schema__path"><code>advanced/tables/opportunities.parquet</code> <span class="schema__count">(5 columns)</span></summary>
+    <table class="schema__table">
+      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>
+      <tbody>
+      <tr><td class="col__name"><code>opportunity_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>lead_id</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>created_at</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>stage</code></td><td class="col__type">string</td><td class="col__desc"></td></tr>
+      <tr><td class="col__name"><code>estimated_acv</code></td><td class="col__type">integer</td><td class="col__desc"></td></tr>
+      </tbody>
+    </table>
+  </details>
+</section>
+<section class="sources">
+  <h2 class="section__heading">Sources</h2>
+  <ul class="sources__list">
+    <li><a href="https://github.com/leadforge-dev/leadforge" target="_blank" rel="noopener noreferrer">leadforge source repository</a></li>
+    <li><a href="https://github.com/leadforge-dev/leadforge/tree/main/release/validation" target="_blank" rel="noopener noreferrer">v1 release validation report</a></li>
+  </ul>
+</section>
+<footer class="dataset-footer">
+  <div class="dataset-footer__keywords"><span class="chip">b2b</span> <span class="chip">classification</span> <span class="chip">crm</span> <span class="chip">education</span> <span class="chip">lead-scoring</span> <span class="chip">saas</span> <span class="chip">synthetic-data</span> <span class="chip">tabular</span></div>
+  <div class="dataset-footer__license">License: MIT</div>
+  <div class="dataset-footer__note">Local Kaggle preview rendered by scripts/preview_kaggle_page.py — not the live dataset page.</div>
+</footer>
+</main>
+</body>
+</html>
diff --git a/scripts/preview_hf_page.py b/scripts/preview_hf_page.py
new file mode 100644
index 0000000..51dcebd
--- /dev/null
+++ b/scripts/preview_hf_page.py
@@ -0,0 +1,572 @@
+#!/usr/bin/env python3
+"""Render an offline mock of the Hugging Face dataset page.
+
+PR 7.2 — middle PR in Phase 7 (LLM critique + publish).  Reads the
+artefact the publish PR will upload (``release/huggingface/README.md``
+or ``release/huggingface-instructor/README.md``) and renders an HTML
+page that mimics the public HF dataset view: header (pretty_name +
+licence + size pill), tag chips, configs dropdown, file tree, the
+README body, and a footer with sources.
+
+Same rationale as ``preview_kaggle_page.py`` — cached previews on
+the live HF page are expensive to roll back, so the publish runbook
+in PR 7.3 cites this script as a required pre-flight.
+
+The rendered HTML is a deterministic function of the input README
+(no ``now()``, no random) — same input → byte-identical HTML.  The
+committed samples at
+``release/_preview_committed/huggingface_{public,instructor}.html``
+are the audit-artefact-sync gate.
+
+Usage::
+
+    # Public variant on http://localhost:8766.
+    python scripts/preview_hf_page.py --open-browser
+
+    # Instructor companion variant (separate input README).
+    python scripts/preview_hf_page.py --variant=instructor
+
+    # Just build the HTML (CI / inspection).
+    python scripts/preview_hf_page.py --no-serve
+
+Exit codes: 0 success / 2 pre-flight error (missing README,
+malformed YAML frontmatter, missing cover image).
+"""
+
+from __future__ import annotations
+
+import argparse
+import http.server
+import re
+import socketserver
+import sys
+import webbrowser
+from collections.abc import Sequence
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Final
+
+import yaml
+
+# Make ``scripts/`` importable regardless of how this file is loaded.
+sys.path.insert(0, str(Path(__file__).resolve().parent))
+
+from _release_common import replace_file  # noqa: E402 — must follow sys.path insert
+
+# ---------------------------------------------------------------------------
+# Defaults
+# ---------------------------------------------------------------------------
+
+DEFAULT_RELEASE_DIR: Final[Path] = Path("release")
+DEFAULT_OUT_DIR_PUBLIC: Final[Path] = Path("release/_preview/huggingface")
+DEFAULT_OUT_DIR_INSTRUCTOR: Final[Path] = Path("release/_preview/huggingface-instructor")
+DEFAULT_PORT: Final[int] = 8766
+
+#: Per-variant relative paths to the README (under ``release_dir``)
+#: and the committed sample HTML (under ``release/_preview_committed/``).
+_VARIANT_README_REL: Final[dict[str, Path]] = {
+    "public": Path("huggingface/README.md"),
+    "instructor": Path("huggingface-instructor/README.md"),
+}
+_VARIANT_SAMPLE_PATH: Final[dict[str, Path]] = {
+    "public": Path("release/_preview_committed/huggingface_public.html"),
+    "instructor": Path("release/_preview_committed/huggingface_instructor.html"),
+}
+VALID_VARIANTS: Final[tuple[str, ...]] = ("public", "instructor")
+
+
+# ---------------------------------------------------------------------------
+# Markdown rendering (gated behind the [publish] extra)
+# ---------------------------------------------------------------------------
+
+
+def _render_markdown(text: str) -> str:
+    """Render ``text`` to HTML using markdown-it-py in GFM-like mode.
+
+    Same posture + dep gating as the Kaggle preview (markdown-it-py
+    via the ``[publish]`` extra; ``linkify`` disabled so the
+    transitive ``linkify-it-py`` dep is not required).  See
+    ``preview_kaggle_page.py`` for the rationale.
+    """
+
+    try:
+        from markdown_it import MarkdownIt
+    except ImportError as exc:  # pragma: no cover — gated by extra
+        raise ImportError(
+            "markdown-it-py is required for the Hugging Face preview page. "
+            "Install the publish extra: pip install -e '.[publish]'"
+        ) from exc
+    md = MarkdownIt("gfm-like").disable("linkify")
+    return md.render(text)
+
+
+# ---------------------------------------------------------------------------
+# Frontmatter parsing
+# ---------------------------------------------------------------------------
+
+#: HF dataset cards open with a ``---`` block of YAML, then the body.
+#: This regex pulls them apart in one shot; ``re.DOTALL`` is essential
+#: because the YAML spans multiple lines.
+_FRONTMATTER_RE: Final[re.Pattern[str]] = re.compile(
+    r"\A---\n(?P<yaml>.*?)\n---\n(?P<body>.*)\Z",
+    re.DOTALL,
+)
+
+
+@dataclass(frozen=True)
+class HuggingFaceDoc:
+    """Parsed HF README — frontmatter dict + body markdown."""
+
+    frontmatter: dict[str, Any]
+    body: str
+
+
+def parse_hf_readme(text: str) -> HuggingFaceDoc:
+    """Split an HF README into YAML frontmatter + Markdown body.
+
+    Raises ``ValueError`` if the document does not open with a
+    ``---``-delimited frontmatter block (every HF dataset card MUST
+    have one — the renderer cannot mock the page without it).
+    """
+
+    match = _FRONTMATTER_RE.match(text)
+    if not match:
+        raise ValueError(
+            "HF README is missing a YAML frontmatter block (expected '---\\n<yaml>\\n---\\n<body>')"
+        )
+    parsed = yaml.safe_load(match.group("yaml")) or {}
+    if not isinstance(parsed, dict):
+        raise ValueError(
+            f"HF README frontmatter is not a YAML mapping (got {type(parsed).__name__})"
+        )
+    return HuggingFaceDoc(frontmatter=parsed, body=match.group("body"))
+
+
+# ---------------------------------------------------------------------------
+# Section renderers — pure, deterministic
+# ---------------------------------------------------------------------------
+
+
+def _escape(value: str) -> str:
+    """HTML-escape a single attribute / text value."""
+
+    return (
+        str(value)
+        .replace("&", "&amp;")
+        .replace("<", "&lt;")
+        .replace(">", "&gt;")
+        .replace('"', "&quot;")
+        .replace("'", "&#39;")
+    )
+
+
+def _render_header(frontmatter: dict[str, Any]) -> str:
+    """Render the page header — pretty_name, licence pill, sizes."""
+
+    pretty_name = _escape(str(frontmatter.get("pretty_name", "")))
+    license_id = _escape(str(frontmatter.get("license", "")))
+    languages = ", ".join(_escape(str(x)) for x in frontmatter.get("language", []) or [])
+    sizes = ", ".join(_escape(str(x)) for x in frontmatter.get("size_categories", []) or [])
+    tasks = ", ".join(_escape(str(x)) for x in frontmatter.get("task_categories", []) or [])
+    return f"""<header class="dataset-header">
+  <div class="dataset-header__namespace">huggingface.co/datasets</div>
+  <h1 class="dataset-header__title">{pretty_name}</h1>
+  <ul class="dataset-header__pills">
+    <li class="pill pill--license">License: {license_id}</li>
+    <li class="pill pill--task">Task: {tasks}</li>
+    <li class="pill pill--size">Size: {sizes}</li>
+    <li class="pill pill--language">Language: {languages}</li>
+  </ul>
+</header>"""
+
+
+def _render_tags(frontmatter: dict[str, Any]) -> str:
+    """Render the tag chip row (mimics HF tag pills under the header)."""
+
+    tags = frontmatter.get("tags", []) or []
+    if not tags:
+        return ""
+    chips = " ".join(f'<span class="chip">{_escape(str(t))}</span>' for t in tags)
+    return f'<section class="tags">\n  {chips}\n</section>'
+
+
+def _render_configs(frontmatter: dict[str, Any]) -> str:
+    """Render the configs dropdown — one entry per ``configs[]`` block.
+
+    Mirrors HF's "Subset" selector at the top of the dataset viewer.
+    Each config lists its data_files (split → path) so the test can
+    assert every config block from the YAML round-trips through to
+    the rendered page.  The default config is flagged.
+    """
+
+    configs = frontmatter.get("configs", []) or []
+    if not configs:
+        return '<section class="configs"><p>No configs declared.</p></section>'
+    blocks: list[str] = []
+    for config in configs:
+        config_name = _escape(str(config.get("config_name", "")))
+        is_default = bool(config.get("default"))
+        default_badge = ' <span class="badge badge--default">default</span>' if is_default else ""
+        data_files = config.get("data_files", []) or []
+        rows = "\n".join(
+            f"      <tr><td>{_escape(str(df.get('split', '')))}</td>"
+            f"<td><code>{_escape(str(df.get('path', '')))}</code></td></tr>"
+            for df in data_files
+        )
+        blocks.append(
+            f'  <details class="config" open>\n'
+            f'    <summary class="config__name"><code>{config_name}</code>{default_badge} '
+            f'<span class="config__count">({len(data_files)} splits)</span>'
+            f"</summary>\n"
+            f'    <table class="config__table">\n'
+            f"      <thead><tr><th>Split</th><th>Path</th></tr></thead>\n"
+            f"      <tbody>\n{rows}\n      </tbody>\n"
+            f"    </table>\n"
+            f"  </details>"
+        )
+    return f"""<section class="configs">
+  <h2 class="section__heading">Configurations / Subsets <span class="section__count">({len(configs)} configs)</span></h2>
+{chr(10).join(blocks)}
+</section>"""
+
+
+def _render_file_tree(frontmatter: dict[str, Any], variant: str) -> str:
+    """Render the file tree.
+
+    HF doesn't ship a structured file inventory in the dataset card
+    YAML the way Kaggle does — ``data_files`` are the only paths
+    declared in the frontmatter.  We list each declared path under
+    its config heading.  The tree is therefore narrower than the
+    real dataset (which also has ``manifest.json``, ``tables/``, etc.)
+    but matches what the YAML knows about, which is what the publish
+    runbook is trying to verify.
+    """
+
+    configs = frontmatter.get("configs", []) or []
+    paths: list[tuple[str, str]] = []
+    for config in configs:
+        config_name = str(config.get("config_name", ""))
+        for df in config.get("data_files", []) or []:
+            paths.append((config_name, str(df.get("path", ""))))
+    if not paths:
+        return ""
+    items = "\n".join(
+        f'    <li class="file"><span class="file__config">[{_escape(c)}]</span> '
+        f'<code class="file__path">{_escape(p)}</code></li>'
+        for c, p in paths
+    )
+    return f"""<section class="files">
+  <h2 class="section__heading">Files declared in YAML <span class="section__count">({len(paths)} files / variant: {_escape(variant)})</span></h2>
+  <ul class="files__list">
+{items}
+  </ul>
+</section>"""
+
+
+def _render_readme_body(body_md: str) -> str:
+    """Render the README body (everything after the YAML)."""
+
+    return f'<section class="readme">\n{_render_markdown(body_md)}</section>'
+
+
+def _render_footer(frontmatter: dict[str, Any], variant: str) -> str:
+    """Render the licence + variant note footer."""
+
+    license_id = _escape(str(frontmatter.get("license", "")))
+    return f"""<footer class="dataset-footer">
+  <div class="dataset-footer__license">License: {license_id}</div>
+  <div class="dataset-footer__variant">Variant: <code>{_escape(variant)}</code></div>
+  <div class="dataset-footer__note">Local Hugging Face preview rendered by scripts/preview_hf_page.py — not the live dataset page.</div>
+</footer>"""
+
+
+# ---------------------------------------------------------------------------
+# HTML wrapper + minimal HF-ish CSS
+# ---------------------------------------------------------------------------
+
+#: Inlined for the same reasons as the Kaggle preview — single
+#: self-contained file, simple byte-comparison in the audit-sync test,
+#: works without the server.
+_PAGE_CSS: Final[str] = """\
+:root { --bg:#fff; --fg:#1f2937; --muted:#6b7280; --accent:#ff9d00; --border:#e5e7eb; --pill-bg:#f3f4f6; --code-bg:#f9fafb; }
+body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, sans-serif; color: var(--fg); background: var(--bg); margin: 0; padding: 0; line-height: 1.6; }
+.container { max-width: 1100px; margin: 0 auto; padding: 24px 32px; }
+.dataset-header { border-bottom: 1px solid var(--border); padding-bottom: 16px; margin-bottom: 24px; }
+.dataset-header__namespace { color: var(--muted); font-size: 0.85em; font-family: monospace; margin-bottom: 4px; }
+.dataset-header__title { font-size: 1.8em; margin: 0 0 12px 0; }
+.dataset-header__pills { list-style: none; padding: 0; margin: 0; display: flex; flex-wrap: wrap; gap: 8px; }
+.pill { background: var(--pill-bg); border-radius: 12px; padding: 4px 12px; font-size: 0.85em; color: var(--fg); }
+.tags { margin: 0 0 24px 0; }
+.chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px 4px 2px 0; font-size: 0.85em; color: var(--fg); }
+.section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; }
+.section__count { color: var(--muted); font-size: 0.7em; font-weight: normal; }
+.config, .file-tree { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; }
+.config__name { cursor: pointer; font-weight: 600; }
+.config__count { color: var(--muted); font-weight: normal; font-size: 0.85em; }
+.badge { display: inline-block; padding: 1px 8px; border-radius: 4px; font-size: 0.75em; font-weight: 600; vertical-align: middle; margin-left: 4px; }
+.badge--default { background: var(--accent); color: white; }
+.config__table { width: 100%; border-collapse: collapse; margin-top: 8px; font-size: 0.9em; }
+.config__table th, .config__table td { text-align: left; padding: 6px 8px; border-bottom: 1px solid var(--border); }
+.config__table th { background: var(--pill-bg); font-weight: 600; }
+.files__list { list-style: none; padding-left: 0; margin: 0; }
+.file { padding: 4px 0; border-bottom: 1px dotted var(--border); }
+.file:last-child { border-bottom: none; }
+.file__config { color: var(--muted); font-size: 0.85em; margin-right: 8px; }
+.file__path { color: var(--accent); }
+.readme { margin: 24px 0; }
+.readme code { background: var(--code-bg); padding: 1px 4px; border-radius: 2px; font-size: 0.9em; }
+.readme pre { background: var(--code-bg); padding: 12px; border-radius: 4px; overflow-x: auto; }
+.readme pre code { background: none; padding: 0; }
+.readme table { border-collapse: collapse; margin: 12px 0; }
+.readme th, .readme td { border: 1px solid var(--border); padding: 6px 10px; text-align: left; }
+.readme blockquote { border-left: 3px solid var(--accent); padding-left: 12px; color: var(--muted); margin: 12px 0; }
+.dataset-footer { margin-top: 48px; padding-top: 16px; border-top: 1px solid var(--border); color: var(--muted); font-size: 0.9em; }
+.dataset-footer__note { font-style: italic; margin-top: 8px; }
+"""
+
+
+def _wrap_html(*, title: str, body: str) -> str:
+    """Wrap the rendered sections in page chrome.
+
+    Order: header → tags → configs → files → readme body → footer.
+    Configs sit above the README because that's the primary affordance
+    on the live HF dataset page (the user picks a subset before
+    reading the body).
+    """
+
+    return f"""<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <title>HF preview — {_escape(title)}</title>
+  <style>{_PAGE_CSS}</style>
+</head>
+<body>
+<main class="container">
+{body}
+</main>
+</body>
+</html>
+"""
+
+
+# ---------------------------------------------------------------------------
+# Top-level renderer
+# ---------------------------------------------------------------------------
+
+
+def render_hf_html(doc: HuggingFaceDoc, *, variant: str) -> str:
+    """Render the full HF preview HTML.
+
+    Pure function: same ``(doc, variant)`` → byte-identical HTML.
+    No I/O, no clock, no random.  Tests rely on this for the
+    audit-artefact-sync gate.
+    """
+
+    body_parts = [
+        _render_header(doc.frontmatter),
+        _render_tags(doc.frontmatter),
+        _render_configs(doc.frontmatter),
+        _render_file_tree(doc.frontmatter, variant=variant),
+        _render_readme_body(doc.body),
+        _render_footer(doc.frontmatter, variant=variant),
+    ]
+    return _wrap_html(
+        title=str(doc.frontmatter.get("pretty_name", "")),
+        body="\n".join(p for p in body_parts if p),
+    )
+
+
+# ---------------------------------------------------------------------------
+# Driver — reads inputs, writes HTML, optionally serves
+# ---------------------------------------------------------------------------
+
+
+@dataclass(frozen=True)
+class PreviewConfig:
+    """Frozen driver config — built from CLI args or test input."""
+
+    release_dir: Path
+    out_dir: Path
+    port: int
+    variant: str
+    open_browser: bool
+    serve: bool
+
+
+@dataclass(frozen=True)
+class PreviewOutcome:
+    """Return value from :func:`run_preview` — used by tests + CLI."""
+
+    html_path: Path
+    cover_path: Path | None
+
+
+def _resolve_cover_image(release_dir: Path, variant: str) -> Path:
+    """Locate the cover image for the variant.
+
+    The HF packager (PR 5.2) copies the cover image into both
+    ``release/huggingface/`` and ``release/huggingface-instructor/``
+    next to each README.  Prefer the variant-tree copy (closest to
+    the artefact the publish PR will upload); fall back to
+    ``release_dir`` for the case where the assembler hasn't been
+    run yet.
+    """
+
+    variant_dir = "huggingface" if variant == "public" else "huggingface-instructor"
+    candidates = [
+        release_dir / variant_dir / "dataset-cover-image.png",
+        release_dir / "dataset-cover-image.png",
+    ]
+    for candidate in candidates:
+        if candidate.is_file():
+            return candidate
+    return candidates[0]
+
+
+def run_preview(config: PreviewConfig) -> PreviewOutcome:
+    """Render the preview HTML, optionally serve it.
+
+    Pre-flight failures (missing README, malformed YAML, missing
+    cover image, unknown variant) raise — the CLI converts to rc=2.
+    """
+
+    if config.variant not in VALID_VARIANTS:
+        raise ValueError(f"unknown --variant {config.variant!r}; expected one of {VALID_VARIANTS}")
+
+    readme_path = config.release_dir / _VARIANT_README_REL[config.variant]
+    if not readme_path.is_file():
+        raise FileNotFoundError(
+            f"HF README not found at {readme_path}; "
+            f"regenerate via scripts/package_hf_release.py "
+            f"--variant={config.variant} first"
+        )
+    doc = parse_hf_readme(readme_path.read_text(encoding="utf-8"))
+
+    cover_src = _resolve_cover_image(config.release_dir, config.variant)
+    if not cover_src.is_file():
+        raise FileNotFoundError(
+            f"cover image not found at {cover_src} (looked in "
+            f"{config.release_dir}/huggingface{'-instructor' if config.variant == 'instructor' else ''}/ "
+            f"and {config.release_dir}/)"
+        )
+
+    config.out_dir.mkdir(parents=True, exist_ok=True)
+    html_path = config.out_dir / "index.html"
+    html_path.write_text(render_hf_html(doc, variant=config.variant), encoding="utf-8")
+
+    cover_dst = config.out_dir / "dataset-cover-image.png"
+    replace_file(cover_src, cover_dst)
+
+    if config.serve:
+        _serve(config.out_dir, config.port, open_browser=config.open_browser)
+
+    return PreviewOutcome(html_path=html_path, cover_path=cover_dst)
+
+
+def _serve(directory: Path, port: int, *, open_browser: bool) -> None:
+    """Start a stdlib HTTP server rooted at ``directory`` and block.
+
+    Same posture as the Kaggle preview — see that module for rationale.
+    """
+
+    handler_factory = _make_handler_factory(directory)
+    url = f"http://localhost:{port}/"
+    print(f"serving {directory} at {url} — Ctrl-C to stop", file=sys.stderr)
+    if open_browser:
+        webbrowser.open(url)
+    with socketserver.ThreadingTCPServer(("", port), handler_factory) as httpd:
+        httpd.serve_forever()
+
+
+def _make_handler_factory(directory: Path) -> type[http.server.SimpleHTTPRequestHandler]:
+    resolved = str(directory.resolve())
+
+    class _Handler(http.server.SimpleHTTPRequestHandler):
+        def __init__(self, *args: Any, **kwargs: Any) -> None:
+            super().__init__(*args, directory=resolved, **kwargs)
+
+    return _Handler
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+
+def parse_args(argv: Sequence[str] | None = None) -> argparse.Namespace:
+    """Parse the CLI.  Free function so tests can build a Namespace."""
+
+    parser = argparse.ArgumentParser(
+        prog="preview_hf_page",
+        description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    parser.add_argument(
+        "--release-dir",
+        type=Path,
+        default=DEFAULT_RELEASE_DIR,
+        help="release tree containing huggingface[-instructor]/README.md (default: %(default)s)",
+    )
+    parser.add_argument(
+        "--out-dir",
+        type=Path,
+        default=None,
+        help=(
+            "where to write the rendered preview "
+            "(default: release/_preview/huggingface for variant=public, "
+            "release/_preview/huggingface-instructor for variant=instructor)"
+        ),
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        default=DEFAULT_PORT,
+        help="port for the local HTTP server (default: %(default)s)",
+    )
+    parser.add_argument(
+        "--variant",
+        choices=VALID_VARIANTS,
+        default="public",
+        help="public (3-tier) or instructor (companion repo); default: %(default)s",
+    )
+    parser.add_argument(
+        "--open-browser",
+        action="store_true",
+        help="pop a browser tab on the served URL after the page renders",
+    )
+    parser.add_argument(
+        "--no-serve",
+        action="store_true",
+        help="render the HTML and exit; don't start the server (CI / inspection mode)",
+    )
+    return parser.parse_args(argv)
+
+
+def main(argv: Sequence[str] | None = None) -> int:
+    args = parse_args(argv)
+    out_dir: Path = args.out_dir or (
+        DEFAULT_OUT_DIR_PUBLIC if args.variant == "public" else DEFAULT_OUT_DIR_INSTRUCTOR
+    )
+    config = PreviewConfig(
+        release_dir=args.release_dir,
+        out_dir=out_dir,
+        port=args.port,
+        variant=args.variant,
+        open_browser=args.open_browser,
+        serve=not args.no_serve,
+    )
+    try:
+        outcome = run_preview(config)
+    except FileNotFoundError as exc:
+        print(f"error: {exc}", file=sys.stderr)
+        return 2
+    except ValueError as exc:
+        print(f"error: {exc}", file=sys.stderr)
+        return 2
+    print(f"wrote {outcome.html_path}", file=sys.stderr)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/preview_kaggle_page.py b/scripts/preview_kaggle_page.py
new file mode 100644
index 0000000..ddc4b73
--- /dev/null
+++ b/scripts/preview_kaggle_page.py
@@ -0,0 +1,608 @@
+#!/usr/bin/env python3
+"""Render an offline mock of the Kaggle dataset page.
+
+PR 7.2 — middle PR in Phase 7 (LLM critique + publish).  Reads the
+artefacts the publish PR will upload (``release/kaggle/dataset-metadata.json``
++ ``release/dataset-cover-image.png``) and renders an HTML page that
+mimics the public Kaggle dataset view: header (title / subtitle /
+licence / id pill / update-frequency pill), cover image, rendered
+description (the inlined README body), file tree of declared
+resources, schema/columns tables for every tabular resource, and a
+licence + sources footer.
+
+The page exists for human click-through review BEFORE the maintainer
+runs the real ``kaggle datasets create`` upload (PR 7.3).  Cached
+previews on the live page are expensive to roll back, so the
+publish runbook in PR 7.3 cites this script as a required pre-flight.
+
+The rendered HTML is a deterministic function of the input artefacts
+(no ``now()``, no random) — same metadata + cover-image filename →
+byte-identical HTML.  The committed sample at
+``release/_preview_committed/kaggle.html`` is the audit-artefact-sync
+gate (mirrors PR 4.1 / 5.1 / 5.2 / 7.1).
+
+Usage::
+
+    # Render + serve on http://localhost:8765, pop a browser tab.
+    python scripts/preview_kaggle_page.py --open-browser
+
+    # Just build the HTML (CI / inspection); no server.
+    python scripts/preview_kaggle_page.py --no-serve
+
+Exit codes: 0 success / 2 pre-flight error (missing metadata,
+missing cover image, malformed JSON).
+"""
+
+from __future__ import annotations
+
+import argparse
+import http.server
+import json
+import re
+import socketserver
+import sys
+import webbrowser
+from collections.abc import Sequence
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Final
+
+# Make ``scripts/`` importable regardless of how this file is loaded
+# (CLI entrypoint, ``importlib.util.spec_from_file_location`` from
+# tests).  Mirrors the pattern in ``package_kaggle_release.py``.
+sys.path.insert(0, str(Path(__file__).resolve().parent))
+
+from _release_common import replace_file  # noqa: E402 — must follow sys.path insert
+
+# ---------------------------------------------------------------------------
+# Defaults
+# ---------------------------------------------------------------------------
+
+DEFAULT_RELEASE_DIR: Final[Path] = Path("release")
+DEFAULT_OUT_DIR: Final[Path] = Path("release/_preview/kaggle")
+DEFAULT_PORT: Final[int] = 8765
+
+#: The committed sample HTML used by the audit-artefact-sync test.
+#: Located outside ``release/_preview/`` so it is not gitignored.
+COMMITTED_SAMPLE_PATH: Final[Path] = Path("release/_preview_committed/kaggle.html")
+
+
+# ---------------------------------------------------------------------------
+# Markdown rendering (gated behind the [publish] extra)
+# ---------------------------------------------------------------------------
+
+
+def _render_markdown(text: str) -> str:
+    """Render ``text`` (the inlined README body) to HTML.
+
+    Uses ``markdown-it-py`` in GFM-like mode (tables, fenced code,
+    autolink, strikethrough) — closest match to how Kaggle renders
+    its description block.  The ``[publish]`` extra (alongside
+    ``datasets`` / ``kaggle``) is the install path; absent dep
+    raises a clear instruction rather than a cryptic ``ImportError``.
+    Footnotes (``[^foo]``) render as literal text, which is faithful
+    enough — Kaggle does not invest in footnote rendering either.
+    """
+
+    try:
+        from markdown_it import MarkdownIt
+    except ImportError as exc:  # pragma: no cover — gated by extra
+        raise ImportError(
+            "markdown-it-py is required for the Kaggle preview page. "
+            "Install the publish extra: pip install -e '.[publish]'"
+        ) from exc
+    # ``gfm-like`` enables linkify by default, which requires the
+    # separate ``linkify-it-py`` package; we explicitly turn it off so
+    # the preview does not pull a transitive dep beyond markdown-it-py.
+    # Tables / fenced code / strikethrough remain on (the bits that
+    # actually matter for faithful Kaggle/HF rendering).
+    md = MarkdownIt("gfm-like").disable("linkify")
+    return md.render(text)
+
+
+# ---------------------------------------------------------------------------
+# Tier inference + file tree
+# ---------------------------------------------------------------------------
+
+#: Kaggle's CLI emits resource paths like ``intro/lead_scoring.csv`` —
+#: the leading path segment is the tier name.  We group resources by
+#: this segment so the rendered file tree mirrors the bundle layout
+#: the user will see on Kaggle.
+_TIER_PATH_RE: Final[re.Pattern[str]] = re.compile(r"^([^/]+)/")
+
+
+def _tier_of(resource_path: str) -> str:
+    """Return the leading path segment of ``resource_path``, or ``""``.
+
+    Used to bucket resources by tier in the file tree.  An empty
+    string indicates a top-level resource (none of these are emitted
+    by the Kaggle packager today, but we tolerate them for forward
+    compatibility).
+    """
+
+    match = _TIER_PATH_RE.match(resource_path)
+    return match.group(1) if match else ""
+
+
+# ---------------------------------------------------------------------------
+# Section renderers — pure, deterministic
+# ---------------------------------------------------------------------------
+
+
+def _render_header(metadata: dict[str, Any]) -> str:
+    """Render the page header — title, subtitle, id pill, licence pill."""
+
+    title = _escape(metadata["title"])
+    subtitle = _escape(metadata["subtitle"])
+    dataset_id = _escape(metadata["id"])
+    license_name = _escape(metadata["licenses"][0]["name"]) if metadata.get("licenses") else ""
+    update_freq = _escape(metadata.get("expectedUpdateFrequency", ""))
+    visibility = "Private" if metadata.get("isPrivate") else "Public"
+
+    return f"""<header class="dataset-header">
+  <div class="dataset-header__id">{dataset_id}</div>
+  <h1 class="dataset-header__title">{title}</h1>
+  <p class="dataset-header__subtitle">{subtitle}</p>
+  <ul class="dataset-header__pills">
+    <li class="pill pill--license">License: {license_name}</li>
+    <li class="pill pill--frequency">Updates: {update_freq}</li>
+    <li class="pill pill--visibility">Visibility: {visibility}</li>
+  </ul>
+</header>"""
+
+
+def _render_cover(cover_image_filename: str) -> str:
+    """Render the cover-image block.
+
+    The ``src`` is a sibling-relative path so the same HTML works
+    against both the runtime preview tree (where the image was copied
+    in) and the committed sample (used for byte-equality only — the
+    sample is not served).
+    """
+
+    src = _escape(cover_image_filename)
+    return f"""<section class="cover">
+  <img class="cover__image" src="{src}" alt="Dataset cover image">
+</section>"""
+
+
+def _render_description(description_md: str) -> str:
+    """Render the inlined README body as HTML."""
+
+    body = _render_markdown(description_md)
+    return f'<section class="description">\n{body}</section>'
+
+
+def _render_file_tree(resources: list[dict[str, Any]]) -> str:
+    """Render the file tree, grouped by tier (leading path segment).
+
+    Inside each tier, files appear in declaration order — matches the
+    order Kaggle renders the resources column.  Each entry is a
+    monospace path + the resource description.
+    """
+
+    by_tier: dict[str, list[dict[str, Any]]] = {}
+    for resource in resources:
+        tier = _tier_of(resource["path"])
+        by_tier.setdefault(tier, []).append(resource)
+
+    blocks: list[str] = []
+    for tier, tier_resources in by_tier.items():
+        tier_label = _escape(tier) if tier else "(top-level)"
+        items: list[str] = []
+        for resource in tier_resources:
+            path = _escape(resource["path"])
+            description = _escape(resource.get("description", ""))
+            items.append(
+                f'    <li class="file"><code class="file__path">{path}</code>'
+                f'<span class="file__desc">{description}</span></li>'
+            )
+        blocks.append(
+            f'  <details class="tier" open>\n'
+            f'    <summary class="tier__name">{tier_label}/ '
+            f'<span class="tier__count">({len(tier_resources)} files)</span>'
+            f"</summary>\n"
+            f'    <ul class="tier__files">\n' + "\n".join(items) + "\n    </ul>\n"
+            "  </details>"
+        )
+    file_count = len(resources)
+    return f"""<section class="files">
+  <h2 class="section__heading">Data Files <span class="section__count">({file_count} total)</span></h2>
+{chr(10).join(blocks)}
+</section>"""
+
+
+def _render_schema_tables(resources: list[dict[str, Any]]) -> str:
+    """Render one schema/columns table per tabular resource.
+
+    Mimics Kaggle's "Data Card" expandable per-file column listing.
+    Resources without a ``schema`` (markdown / JSON) are skipped —
+    same posture as Kaggle.  Column count appears in the heading so
+    the test can assert the table is exhaustive without parsing the
+    DOM.
+    """
+
+    blocks: list[str] = []
+    total_columns = 0
+    for resource in resources:
+        schema = resource.get("schema")
+        if not schema:
+            continue
+        fields = schema.get("fields", [])
+        if not fields:
+            continue
+        total_columns += len(fields)
+        path = _escape(resource["path"])
+        rows: list[str] = []
+        for fd in fields:
+            name = _escape(fd.get("name", ""))
+            ftype = _escape(fd.get("type", ""))
+            description = _escape(fd.get("description", ""))
+            rows.append(
+                f"      <tr>"
+                f'<td class="col__name"><code>{name}</code></td>'
+                f'<td class="col__type">{ftype}</td>'
+                f'<td class="col__desc">{description}</td>'
+                f"</tr>"
+            )
+        blocks.append(
+            f'  <details class="schema" open>\n'
+            f'    <summary class="schema__path"><code>{path}</code> '
+            f'<span class="schema__count">({len(fields)} columns)</span>'
+            f"</summary>\n"
+            f'    <table class="schema__table">\n'
+            f"      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>\n"
+            f"      <tbody>\n" + "\n".join(rows) + "\n      </tbody>\n"
+            "    </table>\n"
+            "  </details>"
+        )
+    return f"""<section class="schemas">
+  <h2 class="section__heading">Schema / Columns <span class="section__count">({total_columns} columns across {len(blocks)} tabular files)</span></h2>
+{chr(10).join(blocks)}
+</section>"""
+
+
+def _render_sources(metadata: dict[str, Any]) -> str:
+    """Render the user-specified sources block."""
+
+    sources = metadata.get("userSpecifiedSources", []) or []
+    if not sources:
+        return ""
+    items = "\n".join(
+        f'    <li><a href="{_escape(s["url"])}" target="_blank" rel="noopener noreferrer">'
+        f"{_escape(s['title'])}</a></li>"
+        for s in sources
+    )
+    return f"""<section class="sources">
+  <h2 class="section__heading">Sources</h2>
+  <ul class="sources__list">
+{items}
+  </ul>
+</section>"""
+
+
+def _render_footer(metadata: dict[str, Any]) -> str:
+    """Render the licence + keywords footer."""
+
+    keywords = metadata.get("keywords", []) or []
+    keyword_chips = " ".join(f'<span class="chip">{_escape(k)}</span>' for k in keywords)
+    license_name = _escape(metadata["licenses"][0]["name"]) if metadata.get("licenses") else ""
+    return f"""<footer class="dataset-footer">
+  <div class="dataset-footer__keywords">{keyword_chips}</div>
+  <div class="dataset-footer__license">License: {license_name}</div>
+  <div class="dataset-footer__note">Local Kaggle preview rendered by scripts/preview_kaggle_page.py — not the live dataset page.</div>
+</footer>"""
+
+
+# ---------------------------------------------------------------------------
+# HTML wrapper + minimal Kaggle-ish CSS
+# ---------------------------------------------------------------------------
+
+#: Kept inline rather than served as a separate ``style.css`` so the
+#: rendered HTML is a single self-contained file — easier to inspect,
+#: easier to byte-compare in the audit-artefact-sync test, and works
+#: without a server (open the committed sample directly in a browser).
+_PAGE_CSS: Final[str] = """\
+:root { --bg:#fff; --fg:#202124; --muted:#5f6368; --accent:#20beff; --border:#e0e0e0; --pill-bg:#f1f3f4; }
+body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; color: var(--fg); background: var(--bg); margin: 0; padding: 0; line-height: 1.5; }
+.container { max-width: 1100px; margin: 0 auto; padding: 24px 32px; }
+.dataset-header { border-bottom: 1px solid var(--border); padding-bottom: 16px; margin-bottom: 24px; }
+.dataset-header__id { color: var(--muted); font-size: 0.85em; font-family: monospace; margin-bottom: 4px; }
+.dataset-header__title { font-size: 1.8em; margin: 0 0 4px 0; }
+.dataset-header__subtitle { color: var(--muted); margin: 0 0 12px 0; }
+.dataset-header__pills { list-style: none; padding: 0; margin: 0; display: flex; flex-wrap: wrap; gap: 8px; }
+.pill { background: var(--pill-bg); border-radius: 12px; padding: 4px 12px; font-size: 0.85em; color: var(--fg); }
+.cover { margin: 0 0 24px 0; border: 1px solid var(--border); border-radius: 4px; overflow: hidden; }
+.cover__image { display: block; max-width: 100%; height: auto; }
+.section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; }
+.section__count { color: var(--muted); font-size: 0.7em; font-weight: normal; }
+.tier, .schema { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; }
+.tier__name, .schema__path { cursor: pointer; font-weight: 600; }
+.tier__count, .schema__count { color: var(--muted); font-weight: normal; font-size: 0.85em; }
+.tier__files { list-style: none; padding: 8px 0 0 0; margin: 0; }
+.file { display: flex; gap: 12px; padding: 4px 0; border-bottom: 1px dotted var(--border); }
+.file:last-child { border-bottom: none; }
+.file__path { color: var(--accent); flex-shrink: 0; }
+.file__desc { color: var(--muted); font-size: 0.9em; }
+.schema__table { width: 100%; border-collapse: collapse; margin-top: 8px; font-size: 0.9em; }
+.schema__table th, .schema__table td { text-align: left; padding: 6px 8px; border-bottom: 1px solid var(--border); vertical-align: top; }
+.schema__table th { background: var(--pill-bg); font-weight: 600; }
+.col__name code { background: none; }
+.col__type { color: var(--muted); font-family: monospace; }
+.description { margin: 24px 0; }
+.description code { background: var(--pill-bg); padding: 1px 4px; border-radius: 2px; font-size: 0.9em; }
+.description pre { background: var(--pill-bg); padding: 12px; border-radius: 4px; overflow-x: auto; }
+.description pre code { background: none; padding: 0; }
+.description table { border-collapse: collapse; margin: 12px 0; }
+.description th, .description td { border: 1px solid var(--border); padding: 6px 10px; text-align: left; }
+.description blockquote { border-left: 3px solid var(--accent); padding-left: 12px; color: var(--muted); margin: 12px 0; }
+.sources__list { padding-left: 20px; }
+.dataset-footer { margin-top: 48px; padding-top: 16px; border-top: 1px solid var(--border); color: var(--muted); font-size: 0.9em; }
+.dataset-footer__keywords { margin-bottom: 8px; }
+.chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px; font-size: 0.85em; }
+.dataset-footer__note { font-style: italic; margin-top: 8px; }
+"""
+
+
+def _wrap_html(*, title: str, body: str) -> str:
+    """Wrap rendered sections in the page chrome.
+
+    Order: header → cover → description → files → schemas → sources →
+    footer.  Description sits above files because Kaggle leads with
+    the dataset card on the public page.
+    """
+
+    return f"""<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <title>Kaggle preview — {_escape(title)}</title>
+  <style>{_PAGE_CSS}</style>
+</head>
+<body>
+<main class="container">
+{body}
+</main>
+</body>
+</html>
+"""
+
+
+def _escape(value: str) -> str:
+    """HTML-escape a single attribute / text value.
+
+    Inlined rather than importing ``html.escape`` so the renderer's
+    surface stays small and the (well-tested) substitution is local
+    and obvious.
+    """
+
+    return (
+        str(value)
+        .replace("&", "&amp;")
+        .replace("<", "&lt;")
+        .replace(">", "&gt;")
+        .replace('"', "&quot;")
+        .replace("'", "&#39;")
+    )
+
+
+# ---------------------------------------------------------------------------
+# Top-level renderer
+# ---------------------------------------------------------------------------
+
+
+def render_kaggle_html(metadata: dict[str, Any], cover_image_filename: str) -> str:
+    """Render the full Kaggle preview HTML.
+
+    Pure function: same ``(metadata, cover_image_filename)`` →
+    byte-identical HTML.  No I/O, no clock, no random.  Tests rely
+    on this for the audit-artefact-sync gate.
+    """
+
+    body_parts = [
+        _render_header(metadata),
+        _render_cover(cover_image_filename),
+        _render_description(metadata.get("description", "")),
+        _render_file_tree(metadata.get("resources", [])),
+        _render_schema_tables(metadata.get("resources", [])),
+        _render_sources(metadata),
+        _render_footer(metadata),
+    ]
+    return _wrap_html(title=metadata.get("title", ""), body="\n".join(p for p in body_parts if p))
+
+
+# ---------------------------------------------------------------------------
+# Driver — reads inputs, writes HTML, optionally serves
+# ---------------------------------------------------------------------------
+
+
+@dataclass(frozen=True)
+class PreviewConfig:
+    """Frozen driver config.
+
+    Mirrors the ``DriverConfig`` posture in
+    ``scripts/run_llm_critique.py`` — building this from CLI args
+    keeps the test surface a Python-level call rather than an exec.
+    """
+
+    release_dir: Path
+    out_dir: Path
+    port: int
+    open_browser: bool
+    serve: bool
+
+
+@dataclass(frozen=True)
+class PreviewOutcome:
+    """Return value from :func:`run_preview` — used by tests + CLI."""
+
+    html_path: Path
+    cover_path: Path | None
+
+
+def _resolve_cover_image(release_dir: Path, image_name: str) -> Path:
+    """Locate the cover image referenced by the metadata's ``image``.
+
+    The Kaggle packager (PR 5.1) copies the cover image into
+    ``release/kaggle/`` next to ``dataset-metadata.json`` AND leaves
+    the master copy at ``release/dataset-cover-image.png``.  We
+    prefer the kaggle-tree copy (closer to the artefact the publish
+    PR will upload) and fall back to ``release_dir`` for the
+    bare-basename case.  Returning the resolved path here mirrors
+    ``_release_common.resolve_cover_image_path`` so the assembler and
+    inputs cannot disagree.
+    """
+
+    candidates = [
+        release_dir / "kaggle" / image_name,
+        release_dir / image_name,
+    ]
+    for candidate in candidates:
+        if candidate.is_file():
+            return candidate
+    return candidates[0]  # surface the missing-file error against the canonical location
+
+
+def run_preview(config: PreviewConfig) -> PreviewOutcome:
+    """Render the preview HTML, optionally serve it.
+
+    Pre-flight failures (missing metadata, malformed JSON, missing
+    cover image) raise — the CLI converts to rc=2.  Validation
+    discipline mirrors the Phase 5 packagers: build → validate → write.
+    """
+
+    metadata_path = config.release_dir / "kaggle" / "dataset-metadata.json"
+    if not metadata_path.is_file():
+        raise FileNotFoundError(
+            f"Kaggle dataset metadata not found at {metadata_path}; "
+            f"regenerate via scripts/package_kaggle_release.py first"
+        )
+    metadata = json.loads(metadata_path.read_text(encoding="utf-8"))
+    if not isinstance(metadata, dict):
+        raise ValueError(f"{metadata_path} is not a JSON object")
+
+    cover_name = metadata.get("image", "")
+    if not cover_name:
+        raise ValueError(f"{metadata_path} declares no 'image' (cover image filename)")
+    cover_src = _resolve_cover_image(config.release_dir, cover_name)
+    if not cover_src.is_file():
+        raise FileNotFoundError(f"cover image declared as {cover_name!r} not found at {cover_src}")
+
+    config.out_dir.mkdir(parents=True, exist_ok=True)
+    html_path = config.out_dir / "index.html"
+    html_path.write_text(render_kaggle_html(metadata, cover_name), encoding="utf-8")
+
+    cover_dst = config.out_dir / cover_name
+    replace_file(cover_src, cover_dst)
+
+    if config.serve:
+        _serve(config.out_dir, config.port, open_browser=config.open_browser)
+
+    return PreviewOutcome(html_path=html_path, cover_path=cover_dst)
+
+
+def _serve(directory: Path, port: int, *, open_browser: bool) -> None:
+    """Start a stdlib HTTP server rooted at ``directory`` and block.
+
+    Uses ``ThreadingHTTPServer`` so the maintainer's browser can fetch
+    the cover image alongside the HTML without serialising requests.
+    Block on ``serve_forever()``; KeyboardInterrupt (Ctrl-C) is the
+    documented exit path.  No coverage here — tests exercise the
+    pure renderer and ``--no-serve`` path; serving is glue that
+    requires a live socket.
+    """
+
+    handler_factory = _make_handler_factory(directory)
+    url = f"http://localhost:{port}/"
+    print(f"serving {directory} at {url} — Ctrl-C to stop", file=sys.stderr)
+    if open_browser:
+        webbrowser.open(url)
+    with socketserver.ThreadingTCPServer(("", port), handler_factory) as httpd:
+        httpd.serve_forever()
+
+
+def _make_handler_factory(directory: Path) -> type[http.server.SimpleHTTPRequestHandler]:
+    """Build a handler subclass that serves from ``directory``.
+
+    ``SimpleHTTPRequestHandler`` ships a ``directory=`` kwarg in
+    Python 3.7+, but threading the path through ``socketserver``'s
+    ``RequestHandlerClass`` requires either a partial or a subclass.
+    Subclassing keeps the import surface stdlib-only.
+    """
+
+    resolved = str(directory.resolve())
+
+    class _Handler(http.server.SimpleHTTPRequestHandler):
+        def __init__(self, *args: Any, **kwargs: Any) -> None:
+            super().__init__(*args, directory=resolved, **kwargs)
+
+    return _Handler
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+
+def parse_args(argv: Sequence[str] | None = None) -> argparse.Namespace:
+    """Parse the CLI.  Free function so tests can build a Namespace."""
+
+    parser = argparse.ArgumentParser(
+        prog="preview_kaggle_page",
+        description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    parser.add_argument(
+        "--release-dir",
+        type=Path,
+        default=DEFAULT_RELEASE_DIR,
+        help="release tree containing kaggle/dataset-metadata.json (default: %(default)s)",
+    )
+    parser.add_argument(
+        "--out-dir",
+        type=Path,
+        default=DEFAULT_OUT_DIR,
+        help="where to write the rendered preview (default: %(default)s)",
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        default=DEFAULT_PORT,
+        help="port for the local HTTP server (default: %(default)s)",
+    )
+    parser.add_argument(
+        "--open-browser",
+        action="store_true",
+        help="pop a browser tab on the served URL after the page renders",
+    )
+    parser.add_argument(
+        "--no-serve",
+        action="store_true",
+        help="render the HTML and exit; don't start the server (CI / inspection mode)",
+    )
+    return parser.parse_args(argv)
+
+
+def main(argv: Sequence[str] | None = None) -> int:
+    args = parse_args(argv)
+    config = PreviewConfig(
+        release_dir=args.release_dir,
+        out_dir=args.out_dir,
+        port=args.port,
+        open_browser=args.open_browser,
+        serve=not args.no_serve,
+    )
+    try:
+        outcome = run_preview(config)
+    except FileNotFoundError as exc:
+        print(f"error: {exc}", file=sys.stderr)
+        return 2
+    except ValueError as exc:
+        print(f"error: {exc}", file=sys.stderr)
+        return 2
+    print(f"wrote {outcome.html_path}", file=sys.stderr)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/tests/scripts/test_preview_hf_page.py b/tests/scripts/test_preview_hf_page.py
new file mode 100644
index 0000000..468ca1c
--- /dev/null
+++ b/tests/scripts/test_preview_hf_page.py
@@ -0,0 +1,444 @@
+"""Tests for ``scripts/preview_hf_page.py`` (PR 7.2).
+
+Locks the local Hugging Face preview-page contract:
+
+* required field labels appear in the rendered HTML (pretty_name,
+  licence, configs, tags) — the four roadmap-mandated HF checks;
+* every Markdown link in the README body resolves to a non-404 URL
+  pattern (no ``](../`` survives, no ``](validation/...)``);
+* every ``configs[]`` block in the YAML round-trips through to the
+  rendered configs dropdown;
+* the renderer is byte-deterministic and the committed samples at
+  ``release/_preview_committed/huggingface_{public,instructor}.html``
+  match a fresh regeneration (audit-artefact-sync gate);
+* the ``--variant`` flag wires up the right input README, output
+  dir, and footer label;
+* the driver exits with rc=2 on missing artefacts (no live HTTP).
+
+No network. No live HTTP.
+"""
+
+from __future__ import annotations
+
+import importlib.util
+import re
+import sys
+from pathlib import Path
+
+import pytest
+
+_REPO_ROOT = Path(__file__).resolve().parents[2]
+_SCRIPT_PATH = _REPO_ROOT / "scripts" / "preview_hf_page.py"
+_spec = importlib.util.spec_from_file_location("preview_hf_page", _SCRIPT_PATH)
+assert _spec is not None
+assert _spec.loader is not None
+preview = importlib.util.module_from_spec(_spec)
+sys.modules["preview_hf_page"] = preview
+_spec.loader.exec_module(preview)
+
+
+_RELEASE_DIR = _REPO_ROOT / "release"
+_PUBLIC_README = _RELEASE_DIR / "huggingface" / "README.md"
+_INSTRUCTOR_README = _RELEASE_DIR / "huggingface-instructor" / "README.md"
+_PUBLIC_SAMPLE = _REPO_ROOT / "release" / "_preview_committed" / "huggingface_public.html"
+_INSTRUCTOR_SAMPLE = _REPO_ROOT / "release" / "_preview_committed" / "huggingface_instructor.html"
+_PUBLIC_PRESENT = _PUBLIC_README.exists()
+_INSTRUCTOR_PRESENT = _INSTRUCTOR_README.exists()
+
+# Same allow-list rule as the Kaggle preview tests — see
+# ``test_preview_kaggle_page.py`` for rationale.
+_LINK_OK_PREFIXES = (
+    "https://github.com/leadforge-dev/leadforge",
+    "https://huggingface.co/datasets/leadforge",
+    "https://example.com",
+    "LICENSE",
+    "#",
+)
+
+
+# ---------------------------------------------------------------------------
+# Frontmatter parsing
+# ---------------------------------------------------------------------------
+
+
+def test_parse_hf_readme_extracts_yaml_and_body() -> None:
+    text = "---\npretty_name: Test\nlicense: mit\n---\n# Body\n\nText.\n"
+    doc = preview.parse_hf_readme(text)
+    assert doc.frontmatter == {"pretty_name": "Test", "license": "mit"}
+    assert doc.body == "# Body\n\nText.\n"
+
+
+def test_parse_hf_readme_rejects_missing_frontmatter() -> None:
+    with pytest.raises(ValueError, match="missing a YAML frontmatter"):
+        preview.parse_hf_readme("# No frontmatter here\n")
+
+
+def test_parse_hf_readme_rejects_non_mapping_frontmatter() -> None:
+    with pytest.raises(ValueError, match="not a YAML mapping"):
+        preview.parse_hf_readme("---\n- 1\n- 2\n---\nbody\n")
+
+
+# ---------------------------------------------------------------------------
+# Pure-renderer fixtures
+# ---------------------------------------------------------------------------
+
+
+def _minimal_doc() -> preview.HuggingFaceDoc:
+    """A minimum-viable HF doc exercising every renderer branch."""
+
+    return preview.HuggingFaceDoc(
+        frontmatter={
+            "pretty_name": "TestSet: Mock HF Dataset",
+            "license": "mit",
+            "language": ["en"],
+            "task_categories": ["tabular-classification"],
+            "size_categories": ["1K<n<10K"],
+            "tags": ["b2b", "tabular"],
+            "configs": [
+                {
+                    "config_name": "intro",
+                    "data_files": [
+                        {"split": "train", "path": "intro/train.parquet"},
+                        {"split": "validation", "path": "intro/valid.parquet"},
+                        {"split": "test", "path": "intro/test.parquet"},
+                    ],
+                },
+                {
+                    "config_name": "intermediate",
+                    "default": True,
+                    "data_files": [
+                        {"split": "train", "path": "intermediate/train.parquet"},
+                    ],
+                },
+            ],
+        },
+        body="# Mock\n\nA [link](https://github.com/leadforge-dev/leadforge).\n",
+    )
+
+
+# ---------------------------------------------------------------------------
+# Required field labels (the four roadmap-mandated HF checks)
+# ---------------------------------------------------------------------------
+
+
+def test_render_includes_pretty_name_and_license() -> None:
+    html = preview.render_hf_html(_minimal_doc(), variant="public")
+    assert "TestSet: Mock HF Dataset" in html
+    assert "License: mit" in html
+    assert "Task: tabular-classification" in html
+    assert "Size: 1K&lt;n&lt;10K" in html  # HTML-escaped
+    assert "Language: en" in html
+
+
+def test_render_emits_one_chip_per_tag() -> None:
+    html = preview.render_hf_html(_minimal_doc(), variant="public")
+    assert '<span class="chip">b2b</span>' in html
+    assert '<span class="chip">tabular</span>' in html
+
+
+def test_render_configs_dropdown_lists_every_config() -> None:
+    """The roadmap-mandated round-trip: every configs[] block from the
+    YAML appears in the rendered dropdown."""
+
+    html = preview.render_hf_html(_minimal_doc(), variant="public")
+    assert "<code>intro</code>" in html
+    assert "<code>intermediate</code>" in html
+    assert "(2 configs)" in html
+
+
+def test_render_configs_flags_the_default() -> None:
+    html = preview.render_hf_html(_minimal_doc(), variant="public")
+    # The default badge appears next to the default config.
+    assert '<span class="badge badge--default">default</span>' in html
+    # Exactly one badge instance — no other config gets it.
+    assert html.count("badge badge--default") == 1
+
+
+def test_render_data_files_appear_under_each_config() -> None:
+    html = preview.render_hf_html(_minimal_doc(), variant="public")
+    assert "intro/train.parquet" in html
+    assert "intro/valid.parquet" in html
+    assert "intro/test.parquet" in html
+    assert "intermediate/train.parquet" in html
+
+
+def test_render_includes_variant_in_footer() -> None:
+    public = preview.render_hf_html(_minimal_doc(), variant="public")
+    instructor = preview.render_hf_html(_minimal_doc(), variant="instructor")
+    assert "Variant: <code>public</code>" in public
+    assert "Variant: <code>instructor</code>" in instructor
+    # Variant differences are localised to the footer + file-tree
+    # heading; the rest of the output is identical.
+    public_no_variant = public.replace("public", "VARIANT")
+    instructor_no_variant = instructor.replace("instructor", "VARIANT")
+    assert public_no_variant == instructor_no_variant
+
+
+def test_render_handles_no_configs_gracefully() -> None:
+    """Edge case: a malformed dataset card with no ``configs`` should
+    still render rather than crash."""
+
+    doc = preview.HuggingFaceDoc(
+        frontmatter={"pretty_name": "X", "license": "mit"},
+        body="body\n",
+    )
+    html = preview.render_hf_html(doc, variant="public")
+    assert "No configs declared." in html
+
+
+def test_render_escapes_html_in_field_values() -> None:
+    """Same XSS-safety guard as the Kaggle preview."""
+
+    doc = preview.HuggingFaceDoc(
+        frontmatter={"pretty_name": "<script>x</script>", "license": "mit"},
+        body="body\n",
+    )
+    html = preview.render_hf_html(doc, variant="public")
+    assert "<script>x</script>" not in html
+    assert "&lt;script&gt;x&lt;/script&gt;" in html
+
+
+# ---------------------------------------------------------------------------
+# Markdown link resolution (the leakage / link-rewrite regression guard)
+# ---------------------------------------------------------------------------
+
+_HREF_RE = re.compile(r'href="([^"]+)"')
+
+
+@pytest.mark.skipif(not _PUBLIC_PRESENT, reason="public README not present")
+def test_public_readme_has_no_unrewritten_relative_links() -> None:
+    """Same source-side regression guard as the Kaggle preview."""
+
+    body = _PUBLIC_README.read_text(encoding="utf-8")
+    assert "](../" not in body, "unrewritten parent-relative link in public README"
+    assert "](validation/" not in body, "unrewritten validation-relative link in public README"
+
+
+@pytest.mark.skipif(not _PUBLIC_PRESENT, reason="public README not present")
+def test_public_rendered_links_point_at_known_targets() -> None:
+    """Every rendered href in the public preview points at one of the
+    allow-listed prefixes — anything else would 404 on the live HF
+    page."""
+
+    doc = preview.parse_hf_readme(_PUBLIC_README.read_text(encoding="utf-8"))
+    html = preview.render_hf_html(doc, variant="public")
+    bad: list[str] = []
+    for href in _HREF_RE.findall(html):
+        if any(href.startswith(prefix) for prefix in _LINK_OK_PREFIXES):
+            continue
+        bad.append(href)
+    assert not bad, f"non-allowlisted hrefs would 404 on HF: {bad[:5]}"
+
+
+@pytest.mark.skipif(not _INSTRUCTOR_PRESENT, reason="instructor README not present")
+def test_instructor_rendered_links_point_at_known_targets() -> None:
+    doc = preview.parse_hf_readme(_INSTRUCTOR_README.read_text(encoding="utf-8"))
+    html = preview.render_hf_html(doc, variant="instructor")
+    bad: list[str] = []
+    for href in _HREF_RE.findall(html):
+        if any(href.startswith(prefix) for prefix in _LINK_OK_PREFIXES):
+            continue
+        bad.append(href)
+    assert not bad, f"non-allowlisted hrefs would 404 on HF: {bad[:5]}"
+
+
+@pytest.mark.skipif(not _PUBLIC_PRESENT, reason="public README not present")
+def test_public_yaml_configs_round_trip_into_html() -> None:
+    """Every ``configs[].config_name`` declared in the YAML appears in
+    the rendered HTML — the round-trip the roadmap mandates."""
+
+    doc = preview.parse_hf_readme(_PUBLIC_README.read_text(encoding="utf-8"))
+    html = preview.render_hf_html(doc, variant="public")
+    for config in doc.frontmatter["configs"]:
+        name = config["config_name"]
+        assert f"<code>{name}</code>" in html, (
+            f"config {name!r} declared in YAML but missing from rendered HTML"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Determinism + audit-artefact-sync (against committed samples)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.skipif(not _PUBLIC_PRESENT, reason="public README not present")
+def test_render_is_byte_deterministic() -> None:
+    doc = preview.parse_hf_readme(_PUBLIC_README.read_text(encoding="utf-8"))
+    a = preview.render_hf_html(doc, variant="public")
+    b = preview.render_hf_html(doc, variant="public")
+    assert a == b
+
+
+@pytest.mark.skipif(
+    not (_PUBLIC_PRESENT and _PUBLIC_SAMPLE.exists()),
+    reason="public README or committed sample missing",
+)
+def test_committed_public_sample_matches_fresh_regeneration() -> None:
+    """Audit-sync gate for the public variant.
+
+    Regenerate via::
+
+        python scripts/preview_hf_page.py --no-serve
+        cp release/_preview/huggingface/index.html \\
+            release/_preview_committed/huggingface_public.html
+    """
+
+    doc = preview.parse_hf_readme(_PUBLIC_README.read_text(encoding="utf-8"))
+    fresh = preview.render_hf_html(doc, variant="public")
+    committed = _PUBLIC_SAMPLE.read_text(encoding="utf-8")
+    assert fresh == committed
+
+
+@pytest.mark.skipif(
+    not (_INSTRUCTOR_PRESENT and _INSTRUCTOR_SAMPLE.exists()),
+    reason="instructor README or committed sample missing",
+)
+def test_committed_instructor_sample_matches_fresh_regeneration() -> None:
+    """Audit-sync gate for the instructor variant."""
+
+    doc = preview.parse_hf_readme(_INSTRUCTOR_README.read_text(encoding="utf-8"))
+    fresh = preview.render_hf_html(doc, variant="instructor")
+    committed = _INSTRUCTOR_SAMPLE.read_text(encoding="utf-8")
+    assert fresh == committed
+
+
+# ---------------------------------------------------------------------------
+# Driver — pre-flight error paths (no server start)
+# ---------------------------------------------------------------------------
+
+
+def _make_config(release_dir: Path, out_dir: Path, *, variant: str = "public") -> object:
+    return preview.PreviewConfig(
+        release_dir=release_dir,
+        out_dir=out_dir,
+        port=8766,
+        variant=variant,
+        open_browser=False,
+        serve=False,
+    )
+
+
+def test_run_preview_raises_on_unknown_variant(tmp_path: Path) -> None:
+    fake_release = tmp_path / "release"
+    fake_release.mkdir()
+    config = _make_config(fake_release, tmp_path / "preview", variant="bogus")
+    with pytest.raises(ValueError, match="unknown --variant"):
+        preview.run_preview(config)  # type: ignore[arg-type]
+
+
+def test_run_preview_raises_on_missing_readme(tmp_path: Path) -> None:
+    fake_release = tmp_path / "release"
+    fake_release.mkdir()
+    config = _make_config(fake_release, tmp_path / "preview")
+    with pytest.raises(FileNotFoundError, match="HF README not found"):
+        preview.run_preview(config)  # type: ignore[arg-type]
+
+
+def test_run_preview_raises_on_malformed_readme(tmp_path: Path) -> None:
+    fake_release = tmp_path / "release"
+    (fake_release / "huggingface").mkdir(parents=True)
+    (fake_release / "huggingface" / "README.md").write_text("# No frontmatter\n", encoding="utf-8")
+    config = _make_config(fake_release, tmp_path / "preview")
+    with pytest.raises(ValueError, match="missing a YAML frontmatter"):
+        preview.run_preview(config)  # type: ignore[arg-type]
+
+
+def test_run_preview_raises_on_missing_cover(tmp_path: Path) -> None:
+    fake_release = tmp_path / "release"
+    (fake_release / "huggingface").mkdir(parents=True)
+    (fake_release / "huggingface" / "README.md").write_text(
+        "---\npretty_name: T\nlicense: mit\n---\nbody\n", encoding="utf-8"
+    )
+    config = _make_config(fake_release, tmp_path / "preview")
+    with pytest.raises(FileNotFoundError, match="cover image"):
+        preview.run_preview(config)  # type: ignore[arg-type]
+
+
+def test_run_preview_writes_html_and_copies_cover(tmp_path: Path) -> None:
+    """End-to-end no-serve: HTML lands at out_dir/index.html and the
+    cover image is copied as a real file."""
+
+    fake_release = tmp_path / "release"
+    (fake_release / "huggingface").mkdir(parents=True)
+    (fake_release / "huggingface" / "README.md").write_text(
+        "---\npretty_name: T\nlicense: mit\n---\nbody\n", encoding="utf-8"
+    )
+    cover = fake_release / "huggingface" / "dataset-cover-image.png"
+    cover.write_bytes(b"\x89PNG\r\n\x1a\nfake")
+    out_dir = tmp_path / "preview"
+    outcome = preview.run_preview(_make_config(fake_release, out_dir))  # type: ignore[arg-type]
+    assert outcome.html_path == out_dir / "index.html"
+    assert outcome.html_path.is_file()
+    assert outcome.cover_path is not None
+    assert outcome.cover_path.is_file()
+    assert not outcome.cover_path.is_symlink()
+
+
+def test_run_preview_instructor_variant_uses_companion_paths(tmp_path: Path) -> None:
+    """``--variant=instructor`` reads the companion README and writes
+    to the companion-flavoured out_dir."""
+
+    fake_release = tmp_path / "release"
+    (fake_release / "huggingface-instructor").mkdir(parents=True)
+    (fake_release / "huggingface-instructor" / "README.md").write_text(
+        "---\npretty_name: I\nlicense: mit\n---\nbody\n", encoding="utf-8"
+    )
+    cover = fake_release / "huggingface-instructor" / "dataset-cover-image.png"
+    cover.write_bytes(b"\x89PNG\r\n\x1a\nfake")
+    out_dir = tmp_path / "preview-instructor"
+    outcome = preview.run_preview(
+        _make_config(fake_release, out_dir, variant="instructor")  # type: ignore[arg-type]
+    )
+    assert outcome.html_path.is_file()
+    assert "Variant: <code>instructor</code>" in outcome.html_path.read_text(encoding="utf-8")
+
+
+def test_main_returns_2_on_missing_release(
+    tmp_path: Path, capsys: pytest.CaptureFixture[str]
+) -> None:
+    rc = preview.main(
+        [
+            "--release-dir",
+            str(tmp_path / "missing"),
+            "--out-dir",
+            str(tmp_path / "preview"),
+            "--no-serve",
+        ]
+    )
+    assert rc == 2
+    captured = capsys.readouterr()
+    assert "HF README not found" in captured.err
+
+
+def test_main_default_out_dir_depends_on_variant(tmp_path: Path) -> None:
+    """``--out-dir`` defaults to the variant-flavoured location."""
+
+    args_public = preview.parse_args(["--no-serve"])
+    args_instructor = preview.parse_args(["--no-serve", "--variant=instructor"])
+    assert args_public.out_dir is None  # resolved in main()
+    assert args_instructor.out_dir is None
+    # Sanity: ``main`` resolves the default per variant.
+    rc = preview.main(
+        [
+            "--release-dir",
+            str(tmp_path / "missing"),
+            "--variant=instructor",
+            "--no-serve",
+        ]
+    )
+    assert rc == 2  # missing README; we just want to confirm CLI parsing didn't crash
+
+
+def test_parse_args_defaults() -> None:
+    args = preview.parse_args(["--no-serve"])
+    assert args.release_dir == preview.DEFAULT_RELEASE_DIR
+    assert args.out_dir is None  # variant-resolved in main()
+    assert args.port == preview.DEFAULT_PORT
+    assert args.variant == "public"
+    assert args.open_browser is False
+    assert args.no_serve is True
+
+
+def test_parse_args_rejects_unknown_variant() -> None:
+    with pytest.raises(SystemExit):
+        preview.parse_args(["--variant=bogus"])
diff --git a/tests/scripts/test_preview_kaggle_page.py b/tests/scripts/test_preview_kaggle_page.py
new file mode 100644
index 0000000..b654a5c
--- /dev/null
+++ b/tests/scripts/test_preview_kaggle_page.py
@@ -0,0 +1,416 @@
+"""Tests for ``scripts/preview_kaggle_page.py`` (PR 7.2).
+
+Locks the local Kaggle preview-page contract:
+
+* required field labels appear in the rendered HTML (title, subtitle,
+  licence, file count, schema column count) — the four roadmap-mandated
+  Kaggle checks;
+* every Markdown link in the inlined description resolves to a
+  non-404 URL pattern (no ``](../`` survives the rewrite, no
+  ``](validation/...)`` lives at a relative path on the upload tree);
+* the Kaggle schema table lists every CSV / parquet column declared
+  in ``dataset-metadata.json::resources[].schema.fields``;
+* the renderer is byte-deterministic and the committed sample at
+  ``release/_preview_committed/kaggle.html`` matches a fresh
+  regeneration (audit-artefact-sync gate, mirrors PR 5.1 / 5.2 / 7.1);
+* the driver exits with rc=2 on missing artefacts (no live HTTP).
+
+No network. No live HTTP. Everything goes through the pure
+``render_kaggle_html()`` or the in-process ``run_preview()`` driver.
+"""
+
+from __future__ import annotations
+
+import importlib.util
+import json
+import re
+import sys
+from pathlib import Path
+
+import pytest
+
+_REPO_ROOT = Path(__file__).resolve().parents[2]
+_SCRIPT_PATH = _REPO_ROOT / "scripts" / "preview_kaggle_page.py"
+_spec = importlib.util.spec_from_file_location("preview_kaggle_page", _SCRIPT_PATH)
+assert _spec is not None
+assert _spec.loader is not None
+preview = importlib.util.module_from_spec(_spec)
+sys.modules["preview_kaggle_page"] = preview
+_spec.loader.exec_module(preview)
+
+
+_RELEASE_DIR = _REPO_ROOT / "release"
+_COMMITTED_METADATA = _RELEASE_DIR / "kaggle" / "dataset-metadata.json"
+_COMMITTED_COVER = _RELEASE_DIR / "dataset-cover-image.png"
+_COMMITTED_SAMPLE = _REPO_ROOT / "release" / "_preview_committed" / "kaggle.html"
+_RELEASE_PRESENT = _COMMITTED_METADATA.exists()
+
+# Allow-listed link patterns the audit-sync test accepts.  Anything else
+# in the rendered description is a regression — either the source
+# README leaked a relative ``../`` link or the GitHub blob rewrite
+# stopped firing.  The whitelist is intentionally narrow.
+_LINK_OK_PREFIXES = (
+    "https://github.com/leadforge-dev/leadforge",
+    "https://huggingface.co/datasets/leadforge",
+    "https://example.com",  # used by unit tests only
+    "LICENSE",  # sibling-relative, resolves under the upload tree
+    "#",  # in-document anchor (footnotes, etc.)
+)
+
+
+# ---------------------------------------------------------------------------
+# Pure-renderer fixtures
+# ---------------------------------------------------------------------------
+
+
+def _minimal_metadata() -> dict[str, object]:
+    """A minimum-viable metadata payload exercising every renderer
+    branch (header pills, file tree, schema table, sources, footer)."""
+
+    return {
+        "title": "TestSet: Lead Scoring Mock",
+        "id": "testorg/testset-lead-scoring",
+        "subtitle": "A mock metadata payload exercising the renderer.",
+        "description": (
+            "# Mock dataset\n\n"
+            "This is a [test link](https://github.com/leadforge-dev/leadforge).\n\n"
+            "| Col | Notes |\n|---|---|\n| a | b |\n"
+        ),
+        "isPrivate": True,
+        "licenses": [{"name": "MIT"}],
+        "keywords": ["b2b", "tabular"],
+        "collaborators": [],
+        "expectedUpdateFrequency": "never",
+        "userSpecifiedSources": [
+            {"title": "source repo", "url": "https://github.com/leadforge-dev/leadforge"},
+        ],
+        "image": "dataset-cover-image.png",
+        "resources": [
+            {
+                "path": "intro/lead_scoring.csv",
+                "description": "Intro flat CSV.",
+                "schema": {
+                    "fields": [
+                        {"name": "lead_id", "type": "string", "description": "Opaque id."},
+                        {"name": "label", "type": "boolean", "description": "Outcome."},
+                    ]
+                },
+            },
+            {
+                "path": "intro/manifest.json",
+                "description": "Provenance manifest (no schema).",
+            },
+        ],
+    }
+
+
+# ---------------------------------------------------------------------------
+# Required field labels (one of the four roadmap-mandated Kaggle checks)
+# ---------------------------------------------------------------------------
+
+
+def test_render_includes_title_subtitle_id_and_license() -> None:
+    html = preview.render_kaggle_html(_minimal_metadata(), "dataset-cover-image.png")
+    assert "TestSet: Lead Scoring Mock" in html
+    assert "A mock metadata payload exercising the renderer." in html
+    assert "testorg/testset-lead-scoring" in html
+    assert "License: MIT" in html
+    assert "Updates: never" in html
+    assert "Visibility: Private" in html
+
+
+def test_render_includes_visibility_public_when_not_private() -> None:
+    metadata = {**_minimal_metadata(), "isPrivate": False}
+    html = preview.render_kaggle_html(metadata, "dataset-cover-image.png")
+    assert "Visibility: Public" in html
+
+
+def test_render_file_tree_lists_every_resource_path() -> None:
+    """File tree shows every resource path declared in metadata."""
+
+    html = preview.render_kaggle_html(_minimal_metadata(), "dataset-cover-image.png")
+    assert "intro/lead_scoring.csv" in html
+    assert "intro/manifest.json" in html
+    assert "(2 total)" in html  # file count appears in the heading
+
+
+def test_render_schema_table_lists_every_column() -> None:
+    """The schema table lists every column from every tabular resource."""
+
+    html = preview.render_kaggle_html(_minimal_metadata(), "dataset-cover-image.png")
+    assert "<code>lead_id</code>" in html
+    assert "<code>label</code>" in html
+    assert "Opaque id." in html
+    assert "(2 columns)" in html  # per-table column count
+    # Resources without a schema (manifest.json) do not appear in the table.
+    assert "(2 columns across 1 tabular files)" in html
+
+
+def test_render_keywords_appear_as_chips_in_footer() -> None:
+    html = preview.render_kaggle_html(_minimal_metadata(), "dataset-cover-image.png")
+    assert '<span class="chip">b2b</span>' in html
+    assert '<span class="chip">tabular</span>' in html
+
+
+def test_render_sources_block_renders_when_present() -> None:
+    html = preview.render_kaggle_html(_minimal_metadata(), "dataset-cover-image.png")
+    assert "source repo" in html
+    assert 'href="https://github.com/leadforge-dev/leadforge"' in html
+
+
+def test_render_sources_block_omitted_when_empty() -> None:
+    metadata = {**_minimal_metadata(), "userSpecifiedSources": []}
+    html = preview.render_kaggle_html(metadata, "dataset-cover-image.png")
+    assert '<h2 class="section__heading">Sources</h2>' not in html
+
+
+def test_render_escapes_html_in_field_values() -> None:
+    """User-controlled strings are HTML-escaped — guards against XSS
+    if a recipe ever surfaces ``<script>`` in a description."""
+
+    metadata = {**_minimal_metadata(), "title": "evil <script>alert(1)</script>"}
+    html = preview.render_kaggle_html(metadata, "dataset-cover-image.png")
+    assert "<script>alert(1)</script>" not in html
+    assert "&lt;script&gt;" in html
+
+
+# ---------------------------------------------------------------------------
+# Schema-fields exhaustiveness (audit-style, against committed metadata)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.skipif(not _RELEASE_PRESENT, reason="release bundles not present")
+def test_committed_metadata_schema_is_fully_listed() -> None:
+    """The roadmap-mandated check: the Kaggle schema table lists every
+    CSV / parquet column declared in dataset-metadata.json."""
+
+    metadata = json.loads(_COMMITTED_METADATA.read_text(encoding="utf-8"))
+    html = preview.render_kaggle_html(metadata, metadata["image"])
+    for resource in metadata["resources"]:
+        schema = resource.get("schema")
+        if not schema:
+            continue
+        for field in schema["fields"]:
+            name = field["name"]
+            # Every column name appears as a ``<code>`` cell in the table.
+            assert f"<code>{name}</code>" in html, (
+                f"schema column {name!r} from {resource['path']!r} not rendered"
+            )
+
+
+# ---------------------------------------------------------------------------
+# Markdown link resolution (the leakage / link-rewrite regression guard)
+# ---------------------------------------------------------------------------
+
+#: Match ``href="X"`` in the rendered HTML — markdown-it-py emits
+#: double-quoted hrefs.  Inline ``](X)`` would slip past this and stay
+#: as escaped text rather than a real link, so we also assert against
+#: those separately.
+_HREF_RE = re.compile(r'href="([^"]+)"')
+
+
+@pytest.mark.skipif(not _RELEASE_PRESENT, reason="release bundles not present")
+def test_committed_metadata_description_has_no_unrewritten_relative_links() -> None:
+    """Source-side regression guard.
+
+    The Kaggle packager runs ``rewrite_release_links()`` on the
+    inlined README; if a future README adds a ``](../foo)`` link or a
+    ``](validation/...)`` link AND someone updates the rewriter to
+    miss it, the rendered description would carry a 404-bound href.
+    Catch it here, before the publish runbook.
+    """
+
+    metadata = json.loads(_COMMITTED_METADATA.read_text(encoding="utf-8"))
+    description = metadata["description"]
+    # Source-form check: no parent-relative or validation-relative
+    # markdown links remain in the inlined description.
+    assert "](../" not in description, (
+        "unrewritten parent-relative markdown link in inlined description"
+    )
+    assert "](validation/" not in description, (
+        "unrewritten validation-relative markdown link in inlined description"
+    )
+
+
+@pytest.mark.skipif(not _RELEASE_PRESENT, reason="release bundles not present")
+def test_committed_metadata_rendered_links_point_at_known_targets() -> None:
+    """Every rendered href in the description body points at one of:
+
+    * a GitHub blob URL (the rewriter's output);
+    * a known external service (huggingface.co/datasets/leadforge);
+    * a sibling-relative path that resolves under the upload tree
+      (LICENSE), or an in-document anchor (#footnote-1 etc.).
+
+    Anything else is a 404 risk on the live page.
+    """
+
+    metadata = json.loads(_COMMITTED_METADATA.read_text(encoding="utf-8"))
+    html = preview.render_kaggle_html(metadata, metadata["image"])
+    bad: list[str] = []
+    for href in _HREF_RE.findall(html):
+        if any(href.startswith(prefix) for prefix in _LINK_OK_PREFIXES):
+            continue
+        bad.append(href)
+    assert not bad, (
+        f"rendered HTML carries non-allowlisted hrefs that would 404 on Kaggle: {bad[:5]}"
+    )
+
+
+# ---------------------------------------------------------------------------
+# Determinism + audit-artefact-sync (against committed sample)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.skipif(not _RELEASE_PRESENT, reason="release bundles not present")
+def test_render_is_byte_deterministic() -> None:
+    """Two back-to-back renders against the same metadata produce
+    byte-identical HTML — the determinism contract this script relies
+    on for the sync test below."""
+
+    metadata = json.loads(_COMMITTED_METADATA.read_text(encoding="utf-8"))
+    a = preview.render_kaggle_html(metadata, metadata["image"])
+    b = preview.render_kaggle_html(metadata, metadata["image"])
+    assert a == b
+
+
+@pytest.mark.skipif(
+    not (_RELEASE_PRESENT and _COMMITTED_SAMPLE.exists()),
+    reason="release bundles or committed preview sample missing",
+)
+def test_committed_sample_matches_fresh_regeneration() -> None:
+    """The audit-artefact-sync gate.
+
+    A fresh render of the committed Kaggle metadata must equal
+    ``release/_preview_committed/kaggle.html`` byte-for-byte.  If
+    this fails, either the renderer changed or the upstream metadata
+    drifted without re-running the preview script.  Regenerate via::
+
+        python scripts/preview_kaggle_page.py --no-serve
+        cp release/_preview/kaggle/index.html release/_preview_committed/kaggle.html
+    """
+
+    metadata = json.loads(_COMMITTED_METADATA.read_text(encoding="utf-8"))
+    fresh = preview.render_kaggle_html(metadata, metadata["image"])
+    committed = _COMMITTED_SAMPLE.read_text(encoding="utf-8")
+    assert fresh == committed
+
+
+# ---------------------------------------------------------------------------
+# Driver — pre-flight error paths (no server start)
+# ---------------------------------------------------------------------------
+
+
+def test_run_preview_raises_on_missing_metadata(tmp_path: Path) -> None:
+    fake_release = tmp_path / "release"
+    fake_release.mkdir()
+    config = preview.PreviewConfig(
+        release_dir=fake_release,
+        out_dir=tmp_path / "preview",
+        port=8765,
+        open_browser=False,
+        serve=False,
+    )
+    with pytest.raises(FileNotFoundError, match="dataset metadata not found"):
+        preview.run_preview(config)
+
+
+def test_run_preview_raises_on_malformed_metadata(tmp_path: Path) -> None:
+    fake_release = tmp_path / "release"
+    (fake_release / "kaggle").mkdir(parents=True)
+    (fake_release / "kaggle" / "dataset-metadata.json").write_text(
+        '"not-an-object"', encoding="utf-8"
+    )
+    config = preview.PreviewConfig(
+        release_dir=fake_release,
+        out_dir=tmp_path / "preview",
+        port=8765,
+        open_browser=False,
+        serve=False,
+    )
+    with pytest.raises(ValueError, match="not a JSON object"):
+        preview.run_preview(config)
+
+
+def test_run_preview_raises_on_missing_cover_image(tmp_path: Path) -> None:
+    fake_release = tmp_path / "release"
+    (fake_release / "kaggle").mkdir(parents=True)
+    (fake_release / "kaggle" / "dataset-metadata.json").write_text(
+        json.dumps({"image": "missing.png", "resources": []}), encoding="utf-8"
+    )
+    config = preview.PreviewConfig(
+        release_dir=fake_release,
+        out_dir=tmp_path / "preview",
+        port=8765,
+        open_browser=False,
+        serve=False,
+    )
+    with pytest.raises(FileNotFoundError, match="cover image"):
+        preview.run_preview(config)
+
+
+def test_run_preview_writes_html_and_copies_cover(tmp_path: Path) -> None:
+    """End-to-end no-serve path: HTML lands at ``out_dir/index.html``;
+    cover image is copied as a real file (not a symlink)."""
+
+    fake_release = tmp_path / "release"
+    (fake_release / "kaggle").mkdir(parents=True)
+    cover_src = fake_release / "kaggle" / "dataset-cover-image.png"
+    cover_src.write_bytes(b"\x89PNG\r\n\x1a\nfake")
+    (fake_release / "kaggle" / "dataset-metadata.json").write_text(
+        json.dumps(_minimal_metadata()), encoding="utf-8"
+    )
+    out_dir = tmp_path / "preview"
+    outcome = preview.run_preview(
+        preview.PreviewConfig(
+            release_dir=fake_release,
+            out_dir=out_dir,
+            port=8765,
+            open_browser=False,
+            serve=False,
+        )
+    )
+    assert outcome.html_path == out_dir / "index.html"
+    assert outcome.html_path.is_file()
+    assert outcome.cover_path is not None
+    assert outcome.cover_path.is_file()
+    assert not outcome.cover_path.is_symlink()
+    # The HTML references the cover image by sibling-relative name.
+    assert 'src="dataset-cover-image.png"' in outcome.html_path.read_text(encoding="utf-8")
+
+
+def test_main_returns_2_on_missing_release(
+    tmp_path: Path, capsys: pytest.CaptureFixture[str]
+) -> None:
+    rc = preview.main(
+        [
+            "--release-dir",
+            str(tmp_path / "missing"),
+            "--out-dir",
+            str(tmp_path / "preview"),
+            "--no-serve",
+        ]
+    )
+    assert rc == 2
+    captured = capsys.readouterr()
+    assert "dataset metadata not found" in captured.err
+
+
+def test_parse_args_defaults() -> None:
+    """``parse_args`` is a free function so tests can exercise the
+    flag wiring without invoking the full driver."""
+
+    args = preview.parse_args(["--no-serve"])
+    assert args.release_dir == preview.DEFAULT_RELEASE_DIR
+    assert args.out_dir == preview.DEFAULT_OUT_DIR
+    assert args.port == preview.DEFAULT_PORT
+    assert args.open_browser is False
+    assert args.no_serve is True
+
+
+def test_tier_of_extracts_leading_path_segment() -> None:
+    """``_tier_of`` is the load-bearing helper that buckets resources
+    by tier in the file tree — pin its contract."""
+
+    assert preview._tier_of("intro/lead_scoring.csv") == "intro"
+    assert preview._tier_of("intermediate/tasks/converted/train.parquet") == "intermediate"
+    assert preview._tier_of("toplevel.json") == ""

From cd36121b6e09c183697b9488281e4518113ce4ce Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Sat, 9 May 2026 10:03:12 +0300
Subject: [PATCH 3/6] PR 7.2 self-review pass 1: socket reuse, dead code, doc
 accuracy
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three findings folded back from the hostile-reviewer first pass:

- BUG: switch from socketserver.ThreadingTCPServer to
  http.server.ThreadingHTTPServer in both preview scripts. The
  former defaults to allow_reuse_address=False, so Ctrl-C → re-run
  within ~60s would raise OSError [Errno 48] Address already in use
  while the socket sat in TIME_WAIT. ThreadingHTTPServer inherits
  allow_reuse_address=True from HTTPServer.
- DEAD CODE: drop COMMITTED_SAMPLE_PATH (Kaggle) and
  _VARIANT_SAMPLE_PATH (HF) — defined as module-level constants but
  never read at runtime; tests use their own _REPO_ROOT-rooted
  paths. Also drop the now-unused socketserver import.
- DOC LIE (minor): _resolve_cover_image docstring in the Kaggle
  script claimed "we prefer the kaggle-tree copy" without
  acknowledging that release/kaggle/dataset-cover-image.png is
  gitignored on a fresh checkout. Reworded to call out the lookup
  order and gitignore reality.

Rendered HTML output unchanged; all 48 preview tests still pass;
audit-sync samples remain byte-identical.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 scripts/preview_hf_page.py     | 13 ++++---------
 scripts/preview_kaggle_page.py | 31 +++++++++++++++----------------
 2 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/scripts/preview_hf_page.py b/scripts/preview_hf_page.py
index 51dcebd..e4c6ecc 100644
--- a/scripts/preview_hf_page.py
+++ b/scripts/preview_hf_page.py
@@ -38,7 +38,6 @@
 import argparse
 import http.server
 import re
-import socketserver
 import sys
 import webbrowser
 from collections.abc import Sequence
@@ -62,16 +61,11 @@
 DEFAULT_OUT_DIR_INSTRUCTOR: Final[Path] = Path("release/_preview/huggingface-instructor")
 DEFAULT_PORT: Final[int] = 8766
 
-#: Per-variant relative paths to the README (under ``release_dir``)
-#: and the committed sample HTML (under ``release/_preview_committed/``).
+#: Per-variant relative path to the README (under ``release_dir``).
 _VARIANT_README_REL: Final[dict[str, Path]] = {
     "public": Path("huggingface/README.md"),
     "instructor": Path("huggingface-instructor/README.md"),
 }
-_VARIANT_SAMPLE_PATH: Final[dict[str, Path]] = {
-    "public": Path("release/_preview_committed/huggingface_public.html"),
-    "instructor": Path("release/_preview_committed/huggingface_instructor.html"),
-}
 VALID_VARIANTS: Final[tuple[str, ...]] = ("public", "instructor")
 
 
@@ -467,7 +461,8 @@ def run_preview(config: PreviewConfig) -> PreviewOutcome:
 def _serve(directory: Path, port: int, *, open_browser: bool) -> None:
     """Start a stdlib HTTP server rooted at ``directory`` and block.
 
-    Same posture as the Kaggle preview — see that module for rationale.
+    Same posture as the Kaggle preview — see that module for the
+    ``allow_reuse_address`` rationale.
     """
 
     handler_factory = _make_handler_factory(directory)
@@ -475,7 +470,7 @@ def _serve(directory: Path, port: int, *, open_browser: bool) -> None:
     print(f"serving {directory} at {url} — Ctrl-C to stop", file=sys.stderr)
     if open_browser:
         webbrowser.open(url)
-    with socketserver.ThreadingTCPServer(("", port), handler_factory) as httpd:
+    with http.server.ThreadingHTTPServer(("", port), handler_factory) as httpd:
         httpd.serve_forever()
 
 
diff --git a/scripts/preview_kaggle_page.py b/scripts/preview_kaggle_page.py
index ddc4b73..8e37011 100644
--- a/scripts/preview_kaggle_page.py
+++ b/scripts/preview_kaggle_page.py
@@ -39,7 +39,6 @@
 import http.server
 import json
 import re
-import socketserver
 import sys
 import webbrowser
 from collections.abc import Sequence
@@ -62,10 +61,6 @@
 DEFAULT_OUT_DIR: Final[Path] = Path("release/_preview/kaggle")
 DEFAULT_PORT: Final[int] = 8765
 
-#: The committed sample HTML used by the audit-artefact-sync test.
-#: Located outside ``release/_preview/`` so it is not gitignored.
-COMMITTED_SAMPLE_PATH: Final[Path] = Path("release/_preview_committed/kaggle.html")
-
 
 # ---------------------------------------------------------------------------
 # Markdown rendering (gated behind the [publish] extra)
@@ -443,14 +438,12 @@ class PreviewOutcome:
 def _resolve_cover_image(release_dir: Path, image_name: str) -> Path:
     """Locate the cover image referenced by the metadata's ``image``.
 
-    The Kaggle packager (PR 5.1) copies the cover image into
-    ``release/kaggle/`` next to ``dataset-metadata.json`` AND leaves
-    the master copy at ``release/dataset-cover-image.png``.  We
-    prefer the kaggle-tree copy (closer to the artefact the publish
-    PR will upload) and fall back to ``release_dir`` for the
-    bare-basename case.  Returning the resolved path here mirrors
-    ``_release_common.resolve_cover_image_path`` so the assembler and
-    inputs cannot disagree.
+    Lookup order: ``release/kaggle/<image_name>`` (assembled
+    upload-tree copy, present after the maintainer runs the Kaggle
+    packager — gitignored, so absent on a fresh checkout) →
+    ``release/<image_name>`` (the committed master copy).  Returning
+    the resolved path here mirrors ``_release_common.resolve_cover_image_path``
+    so the assembler and inputs cannot disagree.
     """
 
     candidates = [
@@ -504,9 +497,15 @@ def run_preview(config: PreviewConfig) -> PreviewOutcome:
 def _serve(directory: Path, port: int, *, open_browser: bool) -> None:
     """Start a stdlib HTTP server rooted at ``directory`` and block.
 
-    Uses ``ThreadingHTTPServer`` so the maintainer's browser can fetch
+    Uses ``http.server.ThreadingHTTPServer`` so the browser can fetch
     the cover image alongside the HTML without serialising requests.
-    Block on ``serve_forever()``; KeyboardInterrupt (Ctrl-C) is the
+    ``ThreadingHTTPServer`` (unlike bare ``socketserver.ThreadingTCPServer``)
+    inherits ``allow_reuse_address = True`` from ``HTTPServer`` —
+    matters because Ctrl-C → re-run within ~60s would otherwise
+    raise ``OSError: [Errno 48] Address already in use`` while the
+    socket sits in TIME_WAIT.
+
+    Blocks on ``serve_forever()``; KeyboardInterrupt (Ctrl-C) is the
     documented exit path.  No coverage here — tests exercise the
     pure renderer and ``--no-serve`` path; serving is glue that
     requires a live socket.
@@ -517,7 +516,7 @@ def _serve(directory: Path, port: int, *, open_browser: bool) -> None:
     print(f"serving {directory} at {url} — Ctrl-C to stop", file=sys.stderr)
     if open_browser:
         webbrowser.open(url)
-    with socketserver.ThreadingTCPServer(("", port), handler_factory) as httpd:
+    with http.server.ThreadingHTTPServer(("", port), handler_factory) as httpd:
         httpd.serve_forever()
 
 

From 6f957dd4ceffb4262e63053eac052036c41e0292 Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Sat, 9 May 2026 10:10:01 +0300
Subject: [PATCH 4/6] PR 7.2: update .agent-plan.md with closed-entry summary

Replaces the open PR 7.2 stub with the dense-summary closed entry
mirroring PR 7.1's structure: every load-bearing decision, both
self-review passes, and the validation-suite numbers inline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .agent-plan.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.agent-plan.md b/.agent-plan.md
index 4f4c290..47c1e6c 100644
--- a/.agent-plan.md
+++ b/.agent-plan.md
@@ -64,7 +64,7 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
 
 ### Phase 7 — LLM critique + publish (3 PRs)
 - [x] PR 7.1: LLM critique module + prompt + driver landed.  `leadforge/validation/llm_critique.py` (new) — single-provider Anthropic critique core via an `LLMCritiqueClient` protocol (no preemptive OpenAI/Gemini stubs); `_AnthropicCritiqueClient` lazy-imports the SDK so the module imports cleanly even on machines without `anthropic` installed (the skip-cleanly path needs to work without the SDK).  `has_anthropic_credentials` / `api_key_or_skip` treat unset and empty-after-strip identically as "absent", explicitly to handle the `env -i` / stale `.envrc` case where the shell sets `ANTHROPIC_API_KEY=""` and the SDK would otherwise 401 instead of cleanly skipping.  Default model `claude-opus-4-7` with `thinking={"type": "adaptive", "display": "summarized"}` (only mode supported on Opus 4.7 — manual `budget_tokens` 400s) and `output_config={"effort": "high"}` (recommended minimum for intelligence-sensitive work per the `claude-api` skill); two prompt-cache breakpoints (rubric + input bundle) per the design doc's caching strategy so the common adjudication-loop workflow hits cache on both layers; streamed via `messages.stream(...).get_final_message()` to dodge the 10-min idle-connection timeout on long adaptive-thinking responses.  `build_input_bundle` is pure (same `release_dir` → byte-identical bytes → identical `sha256`) and assembles eleven blocks: `release/README.md`, per-tier `dataset_card.md`, `docs/release/generation_method.md`, `manifest.json`, `feature_dictionary.csv`, `validation_report.{md,json}`, the first 100 test-split rows rendered as deterministic CSV, the public/instructor diff summary (live-derived from the `BANNED_LEAD_COLUMNS` / `BANNED_OPP_COLUMNS` / `BANNED_TABLES` / `SNAPSHOT_FILTERED_TABLES` constants in `leakage_probes.py` — single source of truth, auto-stays-in-sync, sync-tested), the public-safe mechanism summary (motif family **names** + difficulty knob **names**, never values — same redaction posture as `student_public`), and the break-me guide verbatim ("avoid re-deriving" the existing nine patterns).  `parse_critique_response` schema-validator pins eleven malformations (missing required field, wrong severity, wrong category, wrong rubric dimension, finding-id collision, findings non-list, top-level non-object, non-JSON, score out of range, defensive code-fence stripping, empty findings list valid) and returns every problem in one error rather than the first one.  Output schema is a frozen dataclass (no pydantic dependency) with the nine-value `category` vocabulary lifted **verbatim** from `break_me_guide.md` so findings route to existing issue-template labels without translation; `rubric_dimension: str` is required on every finding (D1-D14) so reviewers can audit clustering.  Provenance triple (`model` / `effort` / `thinking_mode`) plus per-source-file `bundle_hashes` and the assembled `input_bundle_sha256` are carried on every result for audit-artifact-sync — re-runs on the same RC produce the same bundle hashes.  `docs/release/llm_critique_prompt.md` (new) — the rubric document the driver feeds to Claude, parseable via `<system_prompt>` / `<user_cue>` section markers with surrounding prose ignored; fourteen rubric dimensions (D1 documentation truthfulness · D2 leakage discipline · D3 realism vs disclosure · D4 difficulty signal · D5 calibration / value-aware ranking · D6 cohort/time-window discipline · D7 notebook integrity · D8 platform packaging hygiene · D9 adversarial-framing completeness · D10 pedagogy of the documented `total_touches_all` trap · D11 effective semantic diversity per recommendation #12 v1 scope · D12 Datasheets-for-Datasets composition · D13 manifest/provenance integrity · D14 out-of-scope guard).  Severity calibration explicitly written to discourage padding the report with low-severity nits and to surface "no high-severity findings" as a positive signal vs "the critique didn't surface any".  `scripts/run_llm_critique.py` (new) — driver mirroring `validate_release_candidate.py`'s posture (free-function `parse_args`, frozen `DriverConfig`, `run_critique(config) -> DriverResult`, `main(argv)` returning an exit code).  Skip-cleanly path triggers BEFORE any I/O — no rubric read, no bundle build, no out-dir creation; tested explicitly with `not (tmp_path / "out").exists()` after the skip.  Three modes alongside the live path: `--dry-run` writes the rendered input bundle to `<out-dir>/llm_critique_input_<ts>.md` for human inspection (different filename from the real raw JSON, can't be confused); `--no-execute` calls `api_key_or_skip` + `build_anthropic_client()` to prove the SDK is installed and creds are present without burning an API call (CI smoke); `--out-tag` suffixes the raw filename so adjudication re-runs don't shadow the canonical run.  Outputs: timestamped `llm_critique_raw_<UTC-iso>.json` (accumulates per run, no clobber) + canonical `llm_critique_summary.md` (overwritten in place so dataset-card links don't rot).  Exit codes mirror `validate_release_candidate.py`: 0 pass (skip-cleanly counts as pass), 1 high-severity surfaced and unresolved, 2 pre-flight error or schema-validation failure (every problem rendered to stderr, not just the first).  Adjudication is **maintainer-driven** post-exit — resolve in code OR log to `v2_decision_log.md`, then re-run; the next critique's exit code is the gate.  Tests: 61 cases across `tests/validation/test_llm_critique.py` (48) and `tests/scripts/test_run_llm_critique.py` (13), no live API; the protocol is exercised via a small in-process `_CannedClient` fake.  Sync tests pin: every `VALID_CATEGORIES` entry appears in `break_me_guide.md` (vocabulary doesn't drift), `VALID_RUBRIC_DIMENSIONS` is exactly D1-D14, the live-derived public/instructor diff names every banned-column/banned-table constant (live reference, not duplicated string).  Audit-artifact-sync smoke test (`test_real_release_dir_smoke`) builds the input bundle against the actual `release/intermediate/` artefacts and pins determinism on the real input, skipping cleanly when bundles aren't present.  `docs/release/llm_critique_design.md` (new) records the nine load-bearing design calls before implementation so a reviewer can audit the choice (provider abstraction, skip-cleanly, model+caching+thinking, output schema, input-bundle composition, determinism via provenance, CLI flags, test posture, first-run adjudication workflow).  Live first-run deferred to maintainer (no `ANTHROPIC_API_KEY` available to the agent); the dry-run path was exercised against the real release dir end-to-end, producing a 148KB byte-stable input bundle from the actual artefacts.  Hostile self-review pass before requesting review caught and folded back twelve findings against the diff, including two BLOCKERs (`--no-execute` was performing pre-flight I/O before the credentials check, contradicting the design doc; raw-output filename collision at second-precision contradicted the "append-only history" promise — fixed with microsecond precision and a pinning test) and five HIGHs (silent `release_id` default that defeated the audit-artifact-sync gate; design-doc lies about a never-existing `temperature` field and "malformed timestamp" malformation that's driver-generated; dead `if/else` branches in `_safe_difficulty_knobs`; greedy regex for the rubric section markers so the prompt-injection warning paragraph that legitimately references `</user_cue>` doesn't break the parser).  Prompt-injection mitigation added to the rubric (treat-input-as-data preamble) since the input bundle inlines user-authored content (dataset_card.md, break_me_guide.md).  Schema validator hardened against silent `str()` coercion of finding prose fields (an int "claim" would have landed on disk as the string "5" — now rejected).  Net: 1321/1321 tests pass + 5 publish-extra-gated skips; ruff + mypy clean (83 source files); leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5; validation_report timestamp drift reverted before commit per the brief.  Second senior-dev review pass after PR #76 was opened caught and folded back 9 more issues, several of which were real bugs the first hostile pass missed: (B1) `--out-tag` suffixed only the raw JSON, leaving `llm_critique_summary.md` clobbered on adjudication runs — fix suffixes both files (`summary_output_path` now takes `tag`); (B2) skip-cleanly silently passed a release-readiness gate, contradicting `v1_release_roadmap.md`'s line-35 acceptance criterion that the critique must actually run — added `--require-execute` flag (default off; release-readiness CI sets it) that converts the skip path into `MissingCredentialsError` exit 2, plus a loud `WARNING — release-readiness gate has NOT been evaluated` stderr line on the regular skip path; (A2) two prompt-cache breakpoints cut to one — system content already sits inside the cached prefix on `messages.create` (system → messages render order), so the second breakpoint bought nothing and burned a slot; (M1) design doc cut from 394 lines to 73 — the 9-decision table replaces the multi-paragraph rationale-per-call shape that read as documentation theater; (M2) rubric cut from 420 lines to ~210 — each dimension now one paragraph instead of 3-6, dropped D14 ("out-of-scope guard") which was meta-instruction not a rubric dimension, made it a "What is NOT yours to audit" appendix at the end; rubric is now D1-D13 and `VALID_RUBRIC_DIMENSIONS` updated in lockstep; (M3) test-split sample replaced 100 raw rows of CSV with `df.describe(include="all")` per-column statistics + a 20-row head — distributional conclusions need statistics not raw rows, and the rendered input bundle dropped from 148KB to 128KB; (M5) streaming-via-`messages.stream` replaced with `messages.create(timeout=600.0)` — no stream events were processed anyway, the contract is just "don't time out on long adaptive-thinking responses" and an explicit timeout is the right way to spell that; (M6) `render_input_bundle_text` free function moved to `InputBundle.render()` method — leaky abstraction; the audit-artifact-sync framing was misleading (no committed-artefact diff) and was renamed to "smoke test against the real release dir" / "staleness check vs committed result" throughout the module and design doc.  Net after the second pass: 1323/1323 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5; validation_report timestamp drift reverted again before this commit.  First live critique run executed by the maintainer with a dedicated Anthropic project key (`leadforge-llm-critique-v1-prod`): score 7/10, six findings (1 high, 4 medium, 1 low), exit code 1 as designed for unresolved high-severity findings.  Adjudication: F001 high-severity (93 % `account_id` overlap between train/test documented only in break_me_guide §5, missing from README/dataset_card) — **resolved in code** by adding a "Group-leakage warning" paragraph to `release/README.md` "Splits" subsection citing the 518/557 figure and a `GroupKFold(account_id)` recipe; the parallel disclosure on the auto-rendered `dataset_card.md` is logged as `accepted-for-v2` because the renderer change is out of scope for PR 7.1's no-bundle-regen rule.  F004 medium (break_me_guide pattern 5 covered `account_id` but not `contact_id`, despite contacts being shared across the lead-keyed split at the same magnitude) — **resolved in code** by extending §5 to enumerate both keys and any reusable foreign-key column as group-leakage axes.  F006 low (README "Conversion rate (recipe band)" column header didn't make clear it was a recipe-acceptance window not an observed range) — **resolved in code** by renaming to "(acceptance band, gate G7.\*)" and adding a one-sentence note that observed five-seed spreads sit comfortably inside the band.  F002 medium (Gaussian noise produces non-physical values: negative ACV, negative day-deltas, day-deltas > snapshot_day=30, undisclosed in dataset card) — `accepted-for-v2`; requires `leadforge/narrative/dataset_card.py` change.  F003 medium (`](../foo)` relative links would 404 on Kaggle/HF) — `wont-fix`: already treated by `scripts/_release_common.py::rewrite_release_links()` which both platform packagers (PR 5.1, 5.2) call at packaging time; the LLM didn't have visibility into the platform packagers and made a wrong inference.  F005 medium (advanced-tier `calibration_max_bin_error = 0.5234` driven by an n=2 high-probability bin, no minimum-bin-count footnote) — `accepted-for-v2`; not a 1-line change, touches `release_quality.py` metric definition and would require regenerating `validation_report.{json,md}` which PR 7.1's brief explicitly forbids.  Three missing-section callouts (Datasheets §Biases, §Privacy, per-bundle group-split warning) and three maintainer questions (noise/windowing interaction, `top_decile_rate` naming, Kaggle/HF docs subtree) all logged to `docs/release/v2_decision_log.md`.  README edits cascaded into the platform packager artefacts; `release/kaggle/dataset-metadata.json` and `release/huggingface/README.md` regenerated cleanly via the existing packagers (`scripts/package_{kaggle,hf}_release.py`).  Critique run output committed to `release/validation/llm_critique_raw_20260508T204359.124834Z.json` + `release/validation/llm_critique_summary.md`.  Final net: 1325/1325 tests pass + 5 publish-extra-gated skips; ruff + mypy clean (83 source files); leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5.  Phase 7 PR 7.1 closed; PR 7.2 (local Kaggle/HF mock-page preview) is next.
-- [ ] **PR 7.2** — local Kaggle + HuggingFace mock-page preview tooling (must land before PR 7.3): `scripts/preview_kaggle_page.py` and `scripts/preview_hf_page.py` render offline HTML mocks of the public Kaggle and HF dataset pages from the *exact* upload artefacts (metadata JSON, README, cover image), serve over `localhost`, and let the maintainer click through both pages in a browser before any platform upload — catches styling / link / YAML-rendering issues before they hit cached previews on the live page. Tests cover required-field presence, link resolution, schema column listing, configs-block round-trip.
+- [x] PR 7.2: local Kaggle + HuggingFace mock-page preview tooling landed.  `scripts/preview_kaggle_page.py` (new) — reads the *exact* artefacts the publish PR will upload (`release/kaggle/dataset-metadata.json` + the inlined README body + the cover image, prefer `release/kaggle/dataset-cover-image.png` then fall back to the gitignore-resilient `release/dataset-cover-image.png` master copy) and renders an offline HTML page mocking the public Kaggle dataset view: header (title / subtitle / id pill / licence / update-frequency / visibility), cover image, rendered description (the inlined README body), file tree of declared resources grouped by tier with per-tier counts, schema/columns table for every tabular resource (`resources[].schema.fields[].name/type/description`) with per-table column counts in the heading, user-specified-sources block (rendered only when present), keywords + licence footer.  Serves on `http://localhost:8765` via stdlib `http.server.ThreadingHTTPServer` (the threading variant inherits `allow_reuse_address=True` from `HTTPServer`, so Ctrl-C → re-run within ~60s does not raise `OSError [Errno 48] Address already in use` while the socket sits in TIME_WAIT — caught and folded back in self-review pass 1, the initial draft used `socketserver.ThreadingTCPServer` which defaults to `False`).  `--no-serve` builds the HTML and exits (CI / inspection mode); `--open-browser` pops a tab on startup; `--port` / `--release-dir` / `--out-dir` round out the surface.  `scripts/preview_hf_page.py` (new) — reads `release/huggingface/README.md` (or `release/huggingface-instructor/README.md` per `--variant=public|instructor`) and parses YAML frontmatter + Markdown body via a single anchored regex (`r"\A---\n(?P<yaml>.*?)\n---\n(?P<body>.*)\Z"` with `re.DOTALL`); renders the analogous HF view: header pills (pretty_name + license + task_categories + size_categories + language), tag chips, configs dropdown (one details-block per `configs[]` entry with the default config flagged via a single `badge--default` instance, data_files split→path table per config), file tree of declared YAML paths bucketed by config, README body, footer carrying the variant for human visual confirmation.  `--variant` defaults `--out-dir` to `release/_preview/huggingface/` (public) or `release/_preview/huggingface-instructor/` (instructor); the instructor path also reads its README from a different location (`huggingface-instructor/README.md`) and looks for the cover under the variant directory first.  Both scripts share the validation discipline from the Phase 5 packagers: build → validate → write; pre-flight failures (missing metadata, malformed JSON / YAML, unknown variant, missing cover) raise and the CLI converts to rc=2 without touching disk; runtime success exits 0.  Markdown rendering via `markdown-it-py` in `gfm-like` preset (tables / fenced code / strikethrough on; `linkify` explicitly disabled so the optional `linkify-it-py` transitive dep is not required); the dep is added to the `[publish]` extra alongside `datasets` / `kaggle` (mirrors the PR 5.1 / 5.2 gating posture for publish-pipeline tooling), and absent imports raise a clean `ImportError` pointing at `pip install -e ".[publish]"` instead of a cryptic stdlib `ModuleNotFoundError`.  Both renderers are pure: same `(metadata|doc, cover_filename|variant)` → byte-identical HTML (no `now()`, no random, no clock).  Output landing at `release/_preview/<platform>/index.html` is gitignored (`.gitignore` adds `release/_preview/`); the audit-artefact-sync gate lives at `release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html` (committed alongside the scripts, mirrors the PR 4.1 / 5.1 / 5.2 / 7.1 audit-sync pattern).  HTML is wrapped in a single self-contained file (CSS inlined, no external stylesheet) so each committed sample is human-inspectable directly from `git show` or a browser without a server.  XSS-safety: every user-controlled string passes through a hand-rolled `_escape` (`&`, `<`, `>`, `"`, `'`); kept hand-rolled rather than `html.escape` so the committed samples' `&#39;` (decimal) escapes don't churn against `html.escape`'s `&#x27;` (hex) entity.  Tests: 48 cases across `tests/scripts/test_preview_kaggle_page.py` (20) and `tests/scripts/test_preview_hf_page.py` (28); no live HTTP, no network, no socket open.  The four roadmap-mandated checks per script: required field labels appear in rendered HTML (Kaggle: title / subtitle / id / license / file count / schema column count; HF: pretty_name / license / configs / tags); every Markdown link in the source resolves to a non-allowlisted URL pattern fails the test (allow-list: `https://github.com/leadforge-dev/leadforge`, `https://huggingface.co/datasets/leadforge`, sibling-relative `LICENSE`, in-document `#` anchors — anything else is a 404 risk on the live page); the Kaggle schema table lists every column declared in `resources[].schema.fields` (iterates the committed metadata, asserts each `<code>{name}</code>` appears); every `configs[]` block in the HF YAML round-trips into the rendered dropdown.  Determinism is double-tested: `test_render_is_byte_deterministic` runs two passes against the real release artefact and pins equality; `test_committed_*_sample_matches_fresh_regeneration` pins the committed HTML against fresh regeneration byte-for-byte (the audit-sync gate).  Pre-flight error paths exercised end-to-end: missing artefact (`FileNotFoundError`), malformed JSON / YAML (`ValueError`), unknown variant, missing cover image — all return rc=2 via `main()` with informative stderr.  HTML escape coverage: `test_render_escapes_html_in_field_values` asserts a `<script>` payload in the title / pretty_name field is rendered as `&lt;script&gt;`, not as a live tag (XSS guard for any future recipe that surfaces unescaped user content).  `parse_hf_readme` rejects missing-frontmatter and non-mapping-frontmatter inputs explicitly so the renderer never sees half-parsed input.  `pyproject.toml` `[tool.ruff.lint.per-file-ignores]` adds `E501` for both preview scripts — inlined CSS strings inside f-string templates are the rendered product, not source code that benefits from a 100c wrap (mirrors the existing `scripts/build_release_notebook_*.py` ignore for the same reason).  `docs/release/preview_pages_design.md` (new, 59 lines) records the ten load-bearing design calls in the same decision-table shape as `llm_critique_design.md`: two scripts vs unified renderer, stdlib server vs Flask, f-string templates vs Jinja2, `markdown-it-py` via `[publish]` extra (with rationale for why this differs from the PR 5.1 / 5.2 *test* gating — preview scripts' runtime path requires the renderer, not just the smoke test), output-dir convention, cover-image inlining, HF variant flag, CLI shape, audit-sync, test posture (no live HTTP, no BeautifulSoup dep), plus the link-resolution rule (every rendered href must be in the allow-list — guards against the rewrite-stops-firing regression for `](../foo)` and `](validation/...)`).  Hostile self-review pass 1 caught and folded back three findings: (B1) BUG — `socketserver.ThreadingTCPServer` defaults `allow_reuse_address=False`, restart-after-Ctrl-C would 60-second TIME_WAIT; switched to `http.server.ThreadingHTTPServer`; (D1) DEAD CODE — `COMMITTED_SAMPLE_PATH` (Kaggle) and `_VARIANT_SAMPLE_PATH` (HF) module-level constants defined but never read at runtime (tests use their own `_REPO_ROOT`-rooted paths); deleted both, dropped the now-unused `socketserver` import; (M1) DOC LIE — `_resolve_cover_image` Kaggle docstring claimed "we prefer the kaggle-tree copy" without acknowledging that `release/kaggle/dataset-cover-image.png` is gitignored on a fresh checkout (only the committed master copy at `release/dataset-cover-image.png` is guaranteed present); reworded to call out the lookup order + gitignore reality.  Pass 2 found no significant architectural / scope issues — the ~30 lines of intentional duplication between the two scripts (`_escape`, `_serve`, `_make_handler_factory`, partly-duplicated CSS) are below the threshold where a `_preview_common.py` extraction would pay back; the Phase 5 `_release_common.py` exists for things shared between two callers, and a third caller is not on the horizon.  Net: 1373/1373 tests pass (1325 baseline + 48 new) + 5 publish-extra-gated skips; ruff + mypy clean (83 source files); leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0 (3 tiers, 5 seeds, 0 leakage findings); `BUNDLE_SCHEMA_VERSION` unchanged at 5; validation_report timestamp drift reverted before commit per the brief.  Phase 7 PR 7.2 closed; PR 7.3 (`publish_kaggle.py` + `publish_hf.py` + `docs/release/v1_release_notes.md` + tag `leadforge-lead-scoring-v1`) is next, and its publish runbook will cite the two preview commands as a required pre-flight step before `kaggle datasets create` / `huggingface-cli upload`.
 - [ ] **PR 7.3** — `scripts/{publish_kaggle,publish_hf}.py` (dry-run → local mock-page review → private/draft → public). Tag `leadforge-lead-scoring-v1`; `docs/release/v1_release_notes.md` (cites PR 7.2's preview commands as required pre-flight).
 
 ---

From c51d68ffaade8ba2a925176fe1a31c3260da1fb5 Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Sat, 9 May 2026 11:28:00 +0300
Subject: [PATCH 5/6] PR 7.2 self-review pass 3: BLOCKER + 3 HIGH + 2 MEDIUM +
 LOW polish
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A hostile-reviewer pass surfaced one CI blocker and two real fidelity
bugs the first two passes missed. Net diff: -77 lines, leaner code,
honest framing.

BLOCKER — markdown-it-py was in [publish] only; CI's test job
installs only [dev], so every preview test would have ImportError'd
mid-collection. Moved to [dev] (also kept in [publish] so neither
extra is incomplete). Tests run cleanly without the publish extra.

HIGH — Visibility pill removed from the Kaggle preview header.
Kaggle's public dataset page does not display isPrivate; rendering
"Visibility: Private" misled the maintainer about what public viewers
see. New test asserts neither "Visibility:" nor "pill--visibility"
appears in either private or public rendering.

HIGH — HF "Files declared in YAML" section deleted entirely. The
section surfaced an internal concept (configs[].data_files paths
that the configs dropdown already lists) while omitting most of the
real upload tree (manifest.json, tables/*.parquet, etc.). New test
asserts the misleading section is gone.

HIGH — _serve had zero coverage. Refactored: extracted
make_server(directory, port) -> ThreadingHTTPServer as a non-blocking
seam that lets tests bind on port 0, GET /, and shut down cleanly.
New test_make_server_binds_and_serves_index covers handler-factory
construction, address-reuse posture, threaded serve_forever, and
clean shutdown.

MEDIUM — extracted scripts/_preview_common.py with escape, serve,
make_server, _make_handler_factory. Both scripts import from it.
~80 lines of byte-identical duplication eliminated; pass 2's
"below the threshold for extraction" call was wrong — when the
duplicate is exactly identical, the threshold is much lower.

MEDIUM — design doc reframed. Previous language called this an
"audit-artefact-sync" gate; it's actually a regeneration-discipline
gate (a bug in the renderer propagates to both committed sample and
test, so byte-equality alone catches forgotten regenerations, not
correctness — the structural tests are the real audits). Whole
PR repositioned as a "publication-readiness preview", not a
Kaggle/HF look-alike, since pixel fidelity was always out of scope
and the brand-mismatch dominates everything else.

LOW polish:
- PreviewOutcome.cover_path tightened from Path | None to Path
  (always set in practice; speculative-flexibility CLAUDE.md
  flags as bad).
- _TIER_PATH_RE regex replaced by str.split (clearer, equivalent).
- Module docstrings trimmed; rationale lives in the design doc.
- Variant-equality test in HF tests fixed: replace("public", ...)
  was matching "public" inside "publication" in the new footer
  copy; switched to the full Variant: <code>X</code> marker.

Net: 1375/1375 tests pass + 5 publish-extra-gated skips; ruff +
ruff format + mypy clean (83 source files); leakage probes 0/3 on
every tier; hash determinism PASS 67/67; validate_release_candidate
--no-rebuild exits 0; BUNDLE_SCHEMA_VERSION unchanged at 5;
validation_report timestamp drift reverted before commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/release/preview_pages_design.md          |  57 ++--
 pyproject.toml                                |   9 +
 .../huggingface_instructor.html               |  17 +-
 .../huggingface_public.html                   |  23 +-
 release/_preview_committed/kaggle.html        |   3 +-
 scripts/_preview_common.py                    | 100 ++++++
 scripts/preview_hf_page.py                    | 247 ++++-----------
 scripts/preview_kaggle_page.py                | 296 +++++-------------
 tests/scripts/test_preview_hf_page.py         |  32 +-
 tests/scripts/test_preview_kaggle_page.py     |  61 +++-
 10 files changed, 384 insertions(+), 461 deletions(-)
 create mode 100644 scripts/_preview_common.py

diff --git a/docs/release/preview_pages_design.md b/docs/release/preview_pages_design.md
index 0742f59..7ad607c 100644
--- a/docs/release/preview_pages_design.md
+++ b/docs/release/preview_pages_design.md
@@ -1,30 +1,39 @@
 # PR 7.2 — Local Kaggle / HF preview-page design notes
 
 Working notes for `scripts/preview_kaggle_page.py`,
-`scripts/preview_hf_page.py`, their tests, and the committed
-sample-rendered HTML used as the audit-artefact-sync gate. Captured
-before implementation; kept short on purpose.
+`scripts/preview_hf_page.py`, the shared `scripts/_preview_common.py`,
+their tests, and the committed sample HTML used as the
+regeneration-discipline gate. Captured before implementation; revised
+after self-review pass 3 to be honest about scope. Kept short on
+purpose.
 
 The PR's pedagogical role is the *staging gate* before PR 7.3: the
 maintainer renders both platforms locally from the same artefacts the
 publish PR will upload, clicks through them in a browser, and catches
-styling / link / YAML-rendering issues before they hit cached
+link / config / column-listing issues before they hit cached
 previews on the live page.
 
+This is a **publication-readiness preview**, not a Kaggle / HF
+look-alike. Pixel fidelity is explicitly out of scope; the chrome
+(CSS palette, layout) is approximate. The tool's job is structured
+rendering of the upload artefacts so a maintainer can verify the
+content; visual brand matching is not its job.
+
 ## Decisions
 
 | # | Decision | Why |
 |---|---|---|
-| 1 | Two scripts, one per platform. Not a unified renderer. | Kaggle and HF have different inputs (`dataset-metadata.json` vs YAML-frontmatter README) and different page structures (schema/columns table vs configs dropdown). One file per platform keeps each renderer locally complete and the diff readable. |
-| 2 | Server: stdlib `http.server.ThreadingHTTPServer` + `webbrowser.open()`. No Flask. | The pages are static HTML over a fixed file tree. A web framework would be a new dep with no benefit; the brief explicitly suggests stdlib. |
-| 3 | Templates: f-string helpers, not Jinja2. | Layout is layout-stable; two pages don't justify a templating engine. f-string helpers keep the renderer in one file and free of a new dep. |
-| 4 | Markdown→HTML via `markdown-it-py` (added to `[publish]` extra alongside `datasets` / `kaggle`). | Faithfulness is the goal — Kaggle and HF both render the README body as Markdown, hand-rolling a renderer for tables / fenced code / footnotes is brittle. `markdown-it-py` is MIT, pure-Python, CommonMark+GFM. The `[publish]` extra is the right home: this is a publish-pipeline tool, mirrors the PR 5.1 / 5.2 gating posture. Missing dep raises a clean `ImportError` that points at `pip install -e ".[publish]"`. |
-| 5 | Output dir: `release/_preview/<platform>/` (gitignored). | Mirrors `release/_release_quality/` convention. The committed audit-sync samples live at `release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html` so they don't collide with runtime output. |
-| 6 | Cover image served from the preview tree (copied in, not referenced). | Both platforms inline-display the cover image; serving it under the preview root means the rendered HTML's `<img src="dataset-cover-image.png">` works without absolute paths. The committed sample HTML uses the same relative reference — no path drift between the sample and what the local server emits. |
-| 7 | HF `--variant=public|instructor` reads either `release/huggingface/README.md` or `release/huggingface-instructor/README.md`. Different YAML, different file tree, different name. Kaggle has no instructor variant (Kaggle ships public only). | Matches the publish reality (HF gets a separate instructor companion repo per PR 5.2; Kaggle does not). |
-| 8 | CLI mirrors `validate_release_candidate.py` / `run_llm_critique.py`: free-function `parse_args`, frozen `Config`, `run_preview(config) -> Outcome`, `main(argv) -> int`. Exit codes 0 success / 2 pre-flight error. Flags: `--release-dir`, `--port` (8765 Kaggle / 8766 HF), `--out-dir`, `--variant` (HF only), `--open-browser`, `--no-serve`. | Maintainer muscle memory + small surface. `--no-serve` is the CI / inspection mode (build HTML, exit 0). `--open-browser` pops a tab on startup. |
-| 9 | Audit-artifact-sync. The renderer is pure: `(metadata.json | README + YAML, cover image filename) -> HTML`. No `now()`, no random. Committed HTML at `release/_preview_committed/*.html` must equal a fresh regeneration byte-for-byte. Same pattern as PR 4.1 / 5.1 / 5.2 / 7.1. | Determinism is the gate against silent drift. The committed HTML doubles as a human-inspectable sample for reviewers who don't want to run the script. |
-| 10 | Test posture: in-process. No live HTTP. Each test renders the page once via `render_kaggle_html()` / `render_hf_html()` and asserts against the rendered string with substring + regex. No BeautifulSoup dep (avoidable for the assertion bar we need). The four roadmap-mandated checks: required field labels appear; every Markdown link in the source resolves to a non-404 URL pattern; every config block (HF) round-trips; the Kaggle schema table lists every CSV / parquet column from `resources[].schema.fields`. | Per the brief — no live HTTP, no new test deps unless necessary. Substring assertions on deterministic rendered HTML give the same coverage with less surface. |
+| 1 | Two scripts, one per platform, sharing `scripts/_preview_common.py` for `escape` / `make_server` / `serve`. | Inputs are different (`dataset-metadata.json` vs YAML-frontmatter README) and page structures differ enough that one renderer per platform reads better. The 3-helper shared module replaces what was duplicated byte-for-byte. |
+| 2 | Server: stdlib `http.server.ThreadingHTTPServer` via the shared `make_server(directory, port) -> ThreadingHTTPServer`. | `ThreadingHTTPServer` inherits `allow_reuse_address=True` from `HTTPServer` (bare `socketserver.ThreadingTCPServer` does not — Ctrl-C → re-run within ~60s would TIME_WAIT). The `make_server` seam is what lets the tests bind on port 0, GET `/`, and shut down cleanly without forking a subprocess. |
+| 3 | Templates: f-string helpers, not Jinja2. | Layout is layout-stable; two pages don't justify a templating engine. |
+| 4 | Markdown→HTML via `markdown-it-py` (in `[dev]` AND `[publish]`). `gfm-like` preset; `linkify` disabled to avoid the `linkify-it-py` transitive dep. | The dep is small and pure-Python; the renderer is the test surface (not a smoke), so gating it behind `[publish]` would mean CI's `[dev]`-only test job ImportErrors on every render. Listed in both extras so neither path breaks. |
+| 5 | Output dir: `release/_preview/<platform>/` (gitignored). Committed regeneration samples at `release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html`. | Mirrors `release/_release_quality/` convention. The committed samples double as human-inspectable rendered output for code reviewers who don't want to install the dep and run the script. |
+| 6 | Cover image copied into the preview tree (sibling-relative `<img src=>`). | Both platforms inline-display the cover image; serving it under the preview root means the rendered HTML works without absolute paths. |
+| 7 | HF `--variant=public|instructor` reads either `release/huggingface/README.md` or `release/huggingface-instructor/README.md`. Kaggle has no instructor variant. | Matches the PR 5.2 publish reality. |
+| 8 | CLI mirrors `validate_release_candidate.py` / `run_llm_critique.py`: free-function `parse_args`, frozen `Config`, `run_preview(config) -> Outcome`, `main(argv) -> int`. Exit codes 0 / 2. Flags: `--release-dir`, `--port` (8765 Kaggle / 8766 HF), `--out-dir`, `--variant` (HF only), `--open-browser`, `--no-serve`. | Maintainer muscle memory + small surface. `--no-serve` is the CI / inspection mode; `--open-browser` pops a tab on startup. |
+| 9 | The byte-equality test against the committed sample is a **regeneration-discipline gate**, NOT a renderer audit. The renderer-audit work is done by the structural tests (schema-column exhaustiveness, link allow-list, configs round-trip), each of which compares rendered output against an independent source of truth. | A bug in the renderer propagates to both the committed sample (regenerated by the same code) and the test, so byte-equality alone catches "someone forgot to regenerate", not correctness. Calling it "audit-artefact-sync" oversells; the structural tests are the real audits. |
+| 10 | Test posture: in-process. No live HTTP for the rendering tests; `_preview_common.make_server` is exercised by a single port-0 smoke test that GETs `/` and shuts down. The render functions are pure and tested via substring + regex on the rendered string. | No new test deps (no BeautifulSoup). Substring assertions on deterministic rendered HTML give the same coverage with less surface. The smoke test covers the previously-untested server glue. |
+| 11 | Visibility pill (Kaggle) NOT rendered. HF "Files declared in YAML" section NOT rendered. | Both were fidelity bugs caught in self-review pass 3. Kaggle's public page does not display `isPrivate` — showing it would mislead the maintainer about what public viewers see. HF's "Files declared in YAML" surfaced an internal concept (the configs[].data_files paths) that the configs dropdown already lists, while omitting most of the actual upload tree (manifest.json, tables/, feature_dictionary.csv, …). |
 
 ## Link-resolution rule (test pin)
 
@@ -33,16 +42,16 @@ must satisfy ONE of:
 
 1. Absolute `https://github.com/leadforge-dev/leadforge/...` URL (the
    rewrite output of `_release_common.py::rewrite_release_links()`).
-2. External absolute URL on a known-OK domain (`https://huggingface.co`,
-   `https://github.com/leadforge-dev/leadforge`, footnote anchors).
+2. External absolute URL on a known-OK domain (`https://huggingface.co`).
 3. Relative path that resolves to a file under the upload tree
    (e.g. `LICENSE` → `release/<platform>/LICENSE`).
+4. In-document anchor (`#footnote-1` etc.).
 
-A `](../foo)` link or a `](validation/...)` link in the rendered
-HTML is a regression — those are exactly what the platform packagers'
-rewrite is supposed to canonicalise away. The test fires loud the
-moment the rewrite stops doing its job for the upstream artefact the
-preview renders.
+A `](../foo)` or `](validation/...)` link in the rendered HTML is a
+regression — those are exactly what the platform packagers' rewrite
+is supposed to canonicalise away. The test fires the moment the
+rewrite stops doing its job for the upstream artefact the preview
+renders.
 
 ## What this PR does not touch
 
@@ -54,6 +63,6 @@ preview renders.
 - No change to the platform packagers (`scripts/package_{kaggle,hf}_release.py`)
   or `_release_common.py`. The preview reads what the packagers wrote.
 - Live Kaggle / HF API calls — pure local rendering only.
-- Pixel-perfect cloning of the live pages. The bar is "a maintainer
-  clicking through it would notice the same broken link, malformed
-  YAML, or missing config that they'd notice on the live page".
+- Pixel-perfect cloning of the live pages. The bar is structured
+  rendering of the upload artefacts; visual brand matching is out of
+  scope.
diff --git a/pyproject.toml b/pyproject.toml
index 90a742b..871cd2a 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,6 +40,13 @@ dev = [
     "types-pyyaml>=6.0",
     "scikit-learn>=1.3",
     "matplotlib>=3.7",
+    # PR 7.2: the preview-page renderers (scripts/preview_{kaggle,hf}_page.py)
+    # call into markdown-it-py at test time via render_*_html().  Keeping
+    # the dep here as well as in [publish] means CI's "test" job (which
+    # installs only [dev]) does not ImportError mid-test.  pytest.importorskip
+    # would also work, but the rendering tests are the primary coverage of
+    # this PR — gating them off would defeat the purpose.
+    "markdown-it-py>=3.0",
 ]
 scripts = [
     "scikit-learn>=1.3",
@@ -112,6 +119,8 @@ select = ["E", "F", "I", "N", "W", "UP", "B", "C4", "PT", "S"]
 # product, so wrapping the source CSS at 100c is line noise.
 "scripts/preview_kaggle_page.py" = ["E501"]
 "scripts/preview_hf_page.py" = ["E501"]
+# _preview_common is plain Python (no inline HTML / CSS); leaving
+# E501 enabled.
 
 [tool.mypy]
 python_version = "3.11"
diff --git a/release/_preview_committed/huggingface_instructor.html b/release/_preview_committed/huggingface_instructor.html
index 6296a4b..756ecaa 100644
--- a/release/_preview_committed/huggingface_instructor.html
+++ b/release/_preview_committed/huggingface_instructor.html
@@ -15,7 +15,7 @@
 .chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px 4px 2px 0; font-size: 0.85em; color: var(--fg); }
 .section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; }
 .section__count { color: var(--muted); font-size: 0.7em; font-weight: normal; }
-.config, .file-tree { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; }
+.config { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; }
 .config__name { cursor: pointer; font-weight: 600; }
 .config__count { color: var(--muted); font-weight: normal; font-size: 0.85em; }
 .badge { display: inline-block; padding: 1px 8px; border-radius: 4px; font-size: 0.75em; font-weight: 600; vertical-align: middle; margin-left: 4px; }
@@ -23,11 +23,6 @@
 .config__table { width: 100%; border-collapse: collapse; margin-top: 8px; font-size: 0.9em; }
 .config__table th, .config__table td { text-align: left; padding: 6px 8px; border-bottom: 1px solid var(--border); }
 .config__table th { background: var(--pill-bg); font-weight: 600; }
-.files__list { list-style: none; padding-left: 0; margin: 0; }
-.file { padding: 4px 0; border-bottom: 1px dotted var(--border); }
-.file:last-child { border-bottom: none; }
-.file__config { color: var(--muted); font-size: 0.85em; margin-right: 8px; }
-.file__path { color: var(--accent); }
 .readme { margin: 24px 0; }
 .readme code { background: var(--code-bg); padding: 1px 4px; border-radius: 2px; font-size: 0.9em; }
 .readme pre { background: var(--code-bg); padding: 12px; border-radius: 4px; overflow-x: auto; }
@@ -68,14 +63,6 @@ <h2 class="section__heading">Configurations / Subsets <span class="section__coun
     </table>
   </details>
 </section>
-<section class="files">
-  <h2 class="section__heading">Files declared in YAML <span class="section__count">(3 files / variant: instructor)</span></h2>
-  <ul class="files__list">
-    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/train.parquet</code></li>
-    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/valid.parquet</code></li>
-    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/test.parquet</code></li>
-  </ul>
-</section>
 <section class="readme">
 <h1>LeadForge: Synthetic B2B Lead Scoring (v1) — Instructor companion</h1>
 <p>This is the <strong>research / instructor companion</strong> to the public
@@ -277,7 +264,7 @@ <h2>Maintenance, license</h2>
 <footer class="dataset-footer">
   <div class="dataset-footer__license">License: mit</div>
   <div class="dataset-footer__variant">Variant: <code>instructor</code></div>
-  <div class="dataset-footer__note">Local Hugging Face preview rendered by scripts/preview_hf_page.py — not the live dataset page.</div>
+  <div class="dataset-footer__note">Local Hugging Face publication-readiness preview rendered by scripts/preview_hf_page.py — not the live dataset page.</div>
 </footer>
 </main>
 </body>
diff --git a/release/_preview_committed/huggingface_public.html b/release/_preview_committed/huggingface_public.html
index 3f9006a..fd67f54 100644
--- a/release/_preview_committed/huggingface_public.html
+++ b/release/_preview_committed/huggingface_public.html
@@ -15,7 +15,7 @@
 .chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px 4px 2px 0; font-size: 0.85em; color: var(--fg); }
 .section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; }
 .section__count { color: var(--muted); font-size: 0.7em; font-weight: normal; }
-.config, .file-tree { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; }
+.config { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; }
 .config__name { cursor: pointer; font-weight: 600; }
 .config__count { color: var(--muted); font-weight: normal; font-size: 0.85em; }
 .badge { display: inline-block; padding: 1px 8px; border-radius: 4px; font-size: 0.75em; font-weight: 600; vertical-align: middle; margin-left: 4px; }
@@ -23,11 +23,6 @@
 .config__table { width: 100%; border-collapse: collapse; margin-top: 8px; font-size: 0.9em; }
 .config__table th, .config__table td { text-align: left; padding: 6px 8px; border-bottom: 1px solid var(--border); }
 .config__table th { background: var(--pill-bg); font-weight: 600; }
-.files__list { list-style: none; padding-left: 0; margin: 0; }
-.file { padding: 4px 0; border-bottom: 1px dotted var(--border); }
-.file:last-child { border-bottom: none; }
-.file__config { color: var(--muted); font-size: 0.85em; margin-right: 8px; }
-.file__path { color: var(--accent); }
 .readme { margin: 24px 0; }
 .readme code { background: var(--code-bg); padding: 1px 4px; border-radius: 2px; font-size: 0.9em; }
 .readme pre { background: var(--code-bg); padding: 12px; border-radius: 4px; overflow-x: auto; }
@@ -90,20 +85,6 @@ <h2 class="section__heading">Configurations / Subsets <span class="section__coun
     </table>
   </details>
 </section>
-<section class="files">
-  <h2 class="section__heading">Files declared in YAML <span class="section__count">(9 files / variant: public)</span></h2>
-  <ul class="files__list">
-    <li class="file"><span class="file__config">[intro]</span> <code class="file__path">intro/tasks/converted_within_90_days/train.parquet</code></li>
-    <li class="file"><span class="file__config">[intro]</span> <code class="file__path">intro/tasks/converted_within_90_days/valid.parquet</code></li>
-    <li class="file"><span class="file__config">[intro]</span> <code class="file__path">intro/tasks/converted_within_90_days/test.parquet</code></li>
-    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/train.parquet</code></li>
-    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/valid.parquet</code></li>
-    <li class="file"><span class="file__config">[intermediate]</span> <code class="file__path">intermediate/tasks/converted_within_90_days/test.parquet</code></li>
-    <li class="file"><span class="file__config">[advanced]</span> <code class="file__path">advanced/tasks/converted_within_90_days/train.parquet</code></li>
-    <li class="file"><span class="file__config">[advanced]</span> <code class="file__path">advanced/tasks/converted_within_90_days/valid.parquet</code></li>
-    <li class="file"><span class="file__config">[advanced]</span> <code class="file__path">advanced/tasks/converted_within_90_days/test.parquet</code></li>
-  </ul>
-</section>
 <section class="readme">
 <h1>LeadForge: Synthetic B2B Lead Scoring Dataset (<code>leadforge-lead-scoring-v1</code>)</h1>
 <p>A relational, reproducible, three-tier synthetic CRM dataset family for
@@ -473,7 +454,7 @@ <h2>Maintenance, adversarial framing, license</h2>
 <footer class="dataset-footer">
   <div class="dataset-footer__license">License: mit</div>
   <div class="dataset-footer__variant">Variant: <code>public</code></div>
-  <div class="dataset-footer__note">Local Hugging Face preview rendered by scripts/preview_hf_page.py — not the live dataset page.</div>
+  <div class="dataset-footer__note">Local Hugging Face publication-readiness preview rendered by scripts/preview_hf_page.py — not the live dataset page.</div>
 </footer>
 </main>
 </body>
diff --git a/release/_preview_committed/kaggle.html b/release/_preview_committed/kaggle.html
index 4c61444..d0ee29a 100644
--- a/release/_preview_committed/kaggle.html
+++ b/release/_preview_committed/kaggle.html
@@ -52,7 +52,6 @@ <h1 class="dataset-header__title">LeadForge: Synthetic B2B Lead Scoring (v1)</h1
   <ul class="dataset-header__pills">
     <li class="pill pill--license">License: MIT</li>
     <li class="pill pill--frequency">Updates: never</li>
-    <li class="pill pill--visibility">Visibility: Private</li>
   </ul>
 </header>
 <section class="cover">
@@ -1296,7 +1295,7 @@ <h2 class="section__heading">Sources</h2>
 <footer class="dataset-footer">
   <div class="dataset-footer__keywords"><span class="chip">b2b</span> <span class="chip">classification</span> <span class="chip">crm</span> <span class="chip">education</span> <span class="chip">lead-scoring</span> <span class="chip">saas</span> <span class="chip">synthetic-data</span> <span class="chip">tabular</span></div>
   <div class="dataset-footer__license">License: MIT</div>
-  <div class="dataset-footer__note">Local Kaggle preview rendered by scripts/preview_kaggle_page.py — not the live dataset page.</div>
+  <div class="dataset-footer__note">Local Kaggle publication-readiness preview rendered by scripts/preview_kaggle_page.py — not the live dataset page.</div>
 </footer>
 </main>
 </body>
diff --git a/scripts/_preview_common.py b/scripts/_preview_common.py
new file mode 100644
index 0000000..06a2e6d
--- /dev/null
+++ b/scripts/_preview_common.py
@@ -0,0 +1,100 @@
+"""Shared primitives for the local Kaggle / HF preview-page scripts.
+
+PR 7.2 — both ``scripts/preview_kaggle_page.py`` and
+``scripts/preview_hf_page.py`` need to:
+
+* HTML-escape user-controlled strings the same way (and emit the
+  same entity form so committed sample HTML doesn't churn between
+  scripts);
+* construct an ``http.server.ThreadingHTTPServer`` rooted at a
+  preview-output directory (chosen for ``allow_reuse_address=True``
+  inheritance from ``HTTPServer``);
+* start serving + optionally pop a browser tab.
+
+Splitting ``make_server`` away from ``serve`` is what lets the test
+suite stand the server up on port 0 in a thread, GET ``/``, and
+shut down cleanly — the alternative (calling ``serve_forever``
+directly) would require subprocess management and a real port
+allocation race.
+"""
+
+from __future__ import annotations
+
+import http.server
+import sys
+import webbrowser
+from pathlib import Path
+from typing import Any
+
+
+def escape(value: str) -> str:
+    """HTML-escape a single attribute / text value.
+
+    Hand-rolled rather than using ``html.escape`` so the committed
+    sample HTML uses the decimal ``&#39;`` entity for ``'`` (matching
+    what the preview scripts emitted at PR-open time) — switching to
+    ``html.escape``'s ``&#x27;`` would force a regen of every
+    committed sample with no observable rendering difference.
+    """
+
+    return (
+        str(value)
+        .replace("&", "&amp;")
+        .replace("<", "&lt;")
+        .replace(">", "&gt;")
+        .replace('"', "&quot;")
+        .replace("'", "&#39;")
+    )
+
+
+def _make_handler_factory(directory: Path) -> type[http.server.SimpleHTTPRequestHandler]:
+    """Build a handler subclass that serves from ``directory``.
+
+    ``SimpleHTTPRequestHandler`` accepts a ``directory=`` kwarg in
+    Python 3.7+, but threading the path through ``ThreadingHTTPServer``'s
+    ``RequestHandlerClass`` requires either a ``functools.partial`` or
+    a subclass; subclassing keeps the import surface stdlib-only.
+    """
+
+    resolved = str(directory.resolve())
+
+    class _Handler(http.server.SimpleHTTPRequestHandler):
+        def __init__(self, *args: Any, **kwargs: Any) -> None:
+            super().__init__(*args, directory=resolved, **kwargs)
+
+    return _Handler
+
+
+def make_server(directory: Path, port: int) -> http.server.ThreadingHTTPServer:
+    """Build (don't start) an HTTP server rooted at ``directory``.
+
+    ``ThreadingHTTPServer`` (unlike bare ``socketserver.ThreadingTCPServer``)
+    inherits ``allow_reuse_address = True`` from ``HTTPServer`` —
+    matters because Ctrl-C → re-run within ~60s would otherwise raise
+    ``OSError [Errno 48] Address already in use`` while the socket
+    sits in TIME_WAIT.
+
+    Pass ``port=0`` to let the kernel pick a free port; the bound
+    port is then on ``server.server_address[1]``.  This is the seam
+    that makes ``_serve`` testable (test starts the server in a
+    thread, fetches one URL, shuts down).
+    """
+
+    return http.server.ThreadingHTTPServer(("", port), _make_handler_factory(directory))
+
+
+def serve(directory: Path, port: int, *, open_browser: bool) -> None:
+    """Start the HTTP server rooted at ``directory`` and block.
+
+    Blocks on ``serve_forever()``; KeyboardInterrupt (Ctrl-C) is the
+    documented exit path.  Untested by unit tests because it blocks;
+    ``make_server`` is the testable seam.
+    """
+
+    httpd = make_server(directory, port)
+    bound_port = httpd.server_address[1]
+    url = f"http://localhost:{bound_port}/"
+    print(f"serving {directory} at {url} — Ctrl-C to stop", file=sys.stderr)
+    if open_browser:
+        webbrowser.open(url)
+    httpd.serve_forever()
diff --git a/scripts/preview_hf_page.py b/scripts/preview_hf_page.py
index e4c6ecc..4da7669 100644
--- a/scripts/preview_hf_page.py
+++ b/scripts/preview_hf_page.py
@@ -1,45 +1,36 @@
 #!/usr/bin/env python3
-"""Render an offline mock of the Hugging Face dataset page.
+"""Local publication-readiness preview for the Hugging Face dataset page.
 
-PR 7.2 — middle PR in Phase 7 (LLM critique + publish).  Reads the
-artefact the publish PR will upload (``release/huggingface/README.md``
-or ``release/huggingface-instructor/README.md``) and renders an HTML
-page that mimics the public HF dataset view: header (pretty_name +
-licence + size pill), tag chips, configs dropdown, file tree, the
-README body, and a footer with sources.
+PR 7.2.  Reads the artefact the publish PR will upload
+(``release/huggingface/README.md`` or ``release/huggingface-instructor/README.md``
+per ``--variant=public|instructor``), parses the YAML frontmatter +
+Markdown body, renders an offline HTML page that surfaces the
+published structure (header pills, tag chips, configs dropdown,
+README body, footer), and optionally serves it on
+``http://localhost:8766``.
 
-Same rationale as ``preview_kaggle_page.py`` — cached previews on
-the live HF page are expensive to roll back, so the publish runbook
-in PR 7.3 cites this script as a required pre-flight.
+This is a *publication-readiness* preview — structured rendering of
+the upload artefact that helps catch link / config / YAML-rendering
+issues before the real ``huggingface-cli upload``.  It is
+deliberately NOT an HF look-alike: pixel fidelity is out of scope
+and the chrome is approximate.
 
-The rendered HTML is a deterministic function of the input README
-(no ``now()``, no random) — same input → byte-identical HTML.  The
-committed samples at
-``release/_preview_committed/huggingface_{public,instructor}.html``
-are the audit-artefact-sync gate.
+Design rationale + decision log: ``docs/release/preview_pages_design.md``.
 
 Usage::
 
-    # Public variant on http://localhost:8766.
-    python scripts/preview_hf_page.py --open-browser
+    python scripts/preview_hf_page.py --open-browser              # public variant
+    python scripts/preview_hf_page.py --variant=instructor        # companion repo
+    python scripts/preview_hf_page.py --no-serve                  # build only
 
-    # Instructor companion variant (separate input README).
-    python scripts/preview_hf_page.py --variant=instructor
-
-    # Just build the HTML (CI / inspection).
-    python scripts/preview_hf_page.py --no-serve
-
-Exit codes: 0 success / 2 pre-flight error (missing README,
-malformed YAML frontmatter, missing cover image).
+Exit codes: 0 success / 2 pre-flight error.
 """
 
 from __future__ import annotations
 
 import argparse
-import http.server
 import re
 import sys
-import webbrowser
 from collections.abc import Sequence
 from dataclasses import dataclass
 from pathlib import Path
@@ -50,7 +41,8 @@
 # Make ``scripts/`` importable regardless of how this file is loaded.
 sys.path.insert(0, str(Path(__file__).resolve().parent))
 
-from _release_common import replace_file  # noqa: E402 — must follow sys.path insert
+from _preview_common import escape, serve  # noqa: E402 — must follow sys.path insert
+from _release_common import replace_file  # noqa: E402
 
 # ---------------------------------------------------------------------------
 # Defaults
@@ -70,28 +62,20 @@
 
 
 # ---------------------------------------------------------------------------
-# Markdown rendering (gated behind the [publish] extra)
+# Markdown rendering (markdown-it-py is in [dev] AND [publish])
 # ---------------------------------------------------------------------------
 
 
 def _render_markdown(text: str) -> str:
-    """Render ``text`` to HTML using markdown-it-py in GFM-like mode.
-
-    Same posture + dep gating as the Kaggle preview (markdown-it-py
-    via the ``[publish]`` extra; ``linkify`` disabled so the
-    transitive ``linkify-it-py`` dep is not required).  See
-    ``preview_kaggle_page.py`` for the rationale.
-    """
+    """Render ``text`` to HTML.  See preview_kaggle_page._render_markdown."""
 
     try:
         from markdown_it import MarkdownIt
-    except ImportError as exc:  # pragma: no cover — gated by extra
+    except ImportError as exc:  # pragma: no cover — dep is in [dev]
         raise ImportError(
-            "markdown-it-py is required for the Hugging Face preview page. "
-            "Install the publish extra: pip install -e '.[publish]'"
+            "markdown-it-py is required.  pip install -e '.[dev]' (or [publish])."
         ) from exc
-    md = MarkdownIt("gfm-like").disable("linkify")
-    return md.render(text)
+    return MarkdownIt("gfm-like").disable("linkify").render(text)
 
 
 # ---------------------------------------------------------------------------
@@ -99,8 +83,7 @@ def _render_markdown(text: str) -> str:
 # ---------------------------------------------------------------------------
 
 #: HF dataset cards open with a ``---`` block of YAML, then the body.
-#: This regex pulls them apart in one shot; ``re.DOTALL`` is essential
-#: because the YAML spans multiple lines.
+#: ``re.DOTALL`` matters because the YAML spans multiple lines.
 _FRONTMATTER_RE: Final[re.Pattern[str]] = re.compile(
     r"\A---\n(?P<yaml>.*?)\n---\n(?P<body>.*)\Z",
     re.DOTALL,
@@ -119,8 +102,8 @@ def parse_hf_readme(text: str) -> HuggingFaceDoc:
     """Split an HF README into YAML frontmatter + Markdown body.
 
     Raises ``ValueError`` if the document does not open with a
-    ``---``-delimited frontmatter block (every HF dataset card MUST
-    have one — the renderer cannot mock the page without it).
+    ``---``-delimited frontmatter block, or if the YAML is not a
+    mapping (every HF dataset card MUST satisfy both).
     """
 
     match = _FRONTMATTER_RE.match(text)
@@ -141,27 +124,14 @@ def parse_hf_readme(text: str) -> HuggingFaceDoc:
 # ---------------------------------------------------------------------------
 
 
-def _escape(value: str) -> str:
-    """HTML-escape a single attribute / text value."""
-
-    return (
-        str(value)
-        .replace("&", "&amp;")
-        .replace("<", "&lt;")
-        .replace(">", "&gt;")
-        .replace('"', "&quot;")
-        .replace("'", "&#39;")
-    )
-
-
 def _render_header(frontmatter: dict[str, Any]) -> str:
-    """Render the page header — pretty_name, licence pill, sizes."""
+    """Render the page header — pretty_name + licence / task / size pills."""
 
-    pretty_name = _escape(str(frontmatter.get("pretty_name", "")))
-    license_id = _escape(str(frontmatter.get("license", "")))
-    languages = ", ".join(_escape(str(x)) for x in frontmatter.get("language", []) or [])
-    sizes = ", ".join(_escape(str(x)) for x in frontmatter.get("size_categories", []) or [])
-    tasks = ", ".join(_escape(str(x)) for x in frontmatter.get("task_categories", []) or [])
+    pretty_name = escape(str(frontmatter.get("pretty_name", "")))
+    license_id = escape(str(frontmatter.get("license", "")))
+    languages = ", ".join(escape(str(x)) for x in frontmatter.get("language", []) or [])
+    sizes = ", ".join(escape(str(x)) for x in frontmatter.get("size_categories", []) or [])
+    tasks = ", ".join(escape(str(x)) for x in frontmatter.get("task_categories", []) or [])
     return f"""<header class="dataset-header">
   <div class="dataset-header__namespace">huggingface.co/datasets</div>
   <h1 class="dataset-header__title">{pretty_name}</h1>
@@ -175,22 +145,22 @@ def _render_header(frontmatter: dict[str, Any]) -> str:
 
 
 def _render_tags(frontmatter: dict[str, Any]) -> str:
-    """Render the tag chip row (mimics HF tag pills under the header)."""
+    """Render the tag chip row (omitted when no tags)."""
 
     tags = frontmatter.get("tags", []) or []
     if not tags:
         return ""
-    chips = " ".join(f'<span class="chip">{_escape(str(t))}</span>' for t in tags)
+    chips = " ".join(f'<span class="chip">{escape(str(t))}</span>' for t in tags)
     return f'<section class="tags">\n  {chips}\n</section>'
 
 
 def _render_configs(frontmatter: dict[str, Any]) -> str:
     """Render the configs dropdown — one entry per ``configs[]`` block.
 
-    Mirrors HF's "Subset" selector at the top of the dataset viewer.
-    Each config lists its data_files (split → path) so the test can
-    assert every config block from the YAML round-trips through to
-    the rendered page.  The default config is flagged.
+    This is the load-bearing inventory of what the YAML declares: each
+    config + its train/validation/test data_files.  HF's "Subset"
+    selector at the top of the dataset viewer maps to this.  Default
+    config is flagged with a single ``badge--default`` instance.
     """
 
     configs = frontmatter.get("configs", []) or []
@@ -198,13 +168,13 @@ def _render_configs(frontmatter: dict[str, Any]) -> str:
         return '<section class="configs"><p>No configs declared.</p></section>'
     blocks: list[str] = []
     for config in configs:
-        config_name = _escape(str(config.get("config_name", "")))
+        config_name = escape(str(config.get("config_name", "")))
         is_default = bool(config.get("default"))
         default_badge = ' <span class="badge badge--default">default</span>' if is_default else ""
         data_files = config.get("data_files", []) or []
         rows = "\n".join(
-            f"      <tr><td>{_escape(str(df.get('split', '')))}</td>"
-            f"<td><code>{_escape(str(df.get('path', '')))}</code></td></tr>"
+            f"      <tr><td>{escape(str(df.get('split', '')))}</td>"
+            f"<td><code>{escape(str(df.get('path', '')))}</code></td></tr>"
             for df in data_files
         )
         blocks.append(
@@ -224,39 +194,6 @@ def _render_configs(frontmatter: dict[str, Any]) -> str:
 </section>"""
 
 
-def _render_file_tree(frontmatter: dict[str, Any], variant: str) -> str:
-    """Render the file tree.
-
-    HF doesn't ship a structured file inventory in the dataset card
-    YAML the way Kaggle does — ``data_files`` are the only paths
-    declared in the frontmatter.  We list each declared path under
-    its config heading.  The tree is therefore narrower than the
-    real dataset (which also has ``manifest.json``, ``tables/``, etc.)
-    but matches what the YAML knows about, which is what the publish
-    runbook is trying to verify.
-    """
-
-    configs = frontmatter.get("configs", []) or []
-    paths: list[tuple[str, str]] = []
-    for config in configs:
-        config_name = str(config.get("config_name", ""))
-        for df in config.get("data_files", []) or []:
-            paths.append((config_name, str(df.get("path", ""))))
-    if not paths:
-        return ""
-    items = "\n".join(
-        f'    <li class="file"><span class="file__config">[{_escape(c)}]</span> '
-        f'<code class="file__path">{_escape(p)}</code></li>'
-        for c, p in paths
-    )
-    return f"""<section class="files">
-  <h2 class="section__heading">Files declared in YAML <span class="section__count">({len(paths)} files / variant: {_escape(variant)})</span></h2>
-  <ul class="files__list">
-{items}
-  </ul>
-</section>"""
-
-
 def _render_readme_body(body_md: str) -> str:
     """Render the README body (everything after the YAML)."""
 
@@ -266,21 +203,18 @@ def _render_readme_body(body_md: str) -> str:
 def _render_footer(frontmatter: dict[str, Any], variant: str) -> str:
     """Render the licence + variant note footer."""
 
-    license_id = _escape(str(frontmatter.get("license", "")))
+    license_id = escape(str(frontmatter.get("license", "")))
     return f"""<footer class="dataset-footer">
   <div class="dataset-footer__license">License: {license_id}</div>
-  <div class="dataset-footer__variant">Variant: <code>{_escape(variant)}</code></div>
-  <div class="dataset-footer__note">Local Hugging Face preview rendered by scripts/preview_hf_page.py — not the live dataset page.</div>
+  <div class="dataset-footer__variant">Variant: <code>{escape(variant)}</code></div>
+  <div class="dataset-footer__note">Local Hugging Face publication-readiness preview rendered by scripts/preview_hf_page.py — not the live dataset page.</div>
 </footer>"""
 
 
 # ---------------------------------------------------------------------------
-# HTML wrapper + minimal HF-ish CSS
+# HTML wrapper + minimal CSS
 # ---------------------------------------------------------------------------
 
-#: Inlined for the same reasons as the Kaggle preview — single
-#: self-contained file, simple byte-comparison in the audit-sync test,
-#: works without the server.
 _PAGE_CSS: Final[str] = """\
 :root { --bg:#fff; --fg:#1f2937; --muted:#6b7280; --accent:#ff9d00; --border:#e5e7eb; --pill-bg:#f3f4f6; --code-bg:#f9fafb; }
 body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, sans-serif; color: var(--fg); background: var(--bg); margin: 0; padding: 0; line-height: 1.6; }
@@ -294,7 +228,7 @@ def _render_footer(frontmatter: dict[str, Any], variant: str) -> str:
 .chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px 4px 2px 0; font-size: 0.85em; color: var(--fg); }
 .section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; }
 .section__count { color: var(--muted); font-size: 0.7em; font-weight: normal; }
-.config, .file-tree { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; }
+.config { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; }
 .config__name { cursor: pointer; font-weight: 600; }
 .config__count { color: var(--muted); font-weight: normal; font-size: 0.85em; }
 .badge { display: inline-block; padding: 1px 8px; border-radius: 4px; font-size: 0.75em; font-weight: 600; vertical-align: middle; margin-left: 4px; }
@@ -302,11 +236,6 @@ def _render_footer(frontmatter: dict[str, Any], variant: str) -> str:
 .config__table { width: 100%; border-collapse: collapse; margin-top: 8px; font-size: 0.9em; }
 .config__table th, .config__table td { text-align: left; padding: 6px 8px; border-bottom: 1px solid var(--border); }
 .config__table th { background: var(--pill-bg); font-weight: 600; }
-.files__list { list-style: none; padding-left: 0; margin: 0; }
-.file { padding: 4px 0; border-bottom: 1px dotted var(--border); }
-.file:last-child { border-bottom: none; }
-.file__config { color: var(--muted); font-size: 0.85em; margin-right: 8px; }
-.file__path { color: var(--accent); }
 .readme { margin: 24px 0; }
 .readme code { background: var(--code-bg); padding: 1px 4px; border-radius: 2px; font-size: 0.9em; }
 .readme pre { background: var(--code-bg); padding: 12px; border-radius: 4px; overflow-x: auto; }
@@ -320,19 +249,11 @@ def _render_footer(frontmatter: dict[str, Any], variant: str) -> str:
 
 
 def _wrap_html(*, title: str, body: str) -> str:
-    """Wrap the rendered sections in page chrome.
-
-    Order: header → tags → configs → files → readme body → footer.
-    Configs sit above the README because that's the primary affordance
-    on the live HF dataset page (the user picks a subset before
-    reading the body).
-    """
-
     return f"""<!DOCTYPE html>
 <html lang="en">
 <head>
   <meta charset="utf-8">
-  <title>HF preview — {_escape(title)}</title>
+  <title>HF preview — {escape(title)}</title>
   <style>{_PAGE_CSS}</style>
 </head>
 <body>
@@ -352,16 +273,14 @@ def _wrap_html(*, title: str, body: str) -> str:
 def render_hf_html(doc: HuggingFaceDoc, *, variant: str) -> str:
     """Render the full HF preview HTML.
 
-    Pure function: same ``(doc, variant)`` → byte-identical HTML.
-    No I/O, no clock, no random.  Tests rely on this for the
-    audit-artefact-sync gate.
+    Pure: same ``(doc, variant)`` → byte-identical HTML.  No I/O,
+    no clock, no random.
     """
 
     body_parts = [
         _render_header(doc.frontmatter),
         _render_tags(doc.frontmatter),
         _render_configs(doc.frontmatter),
-        _render_file_tree(doc.frontmatter, variant=variant),
         _render_readme_body(doc.body),
         _render_footer(doc.frontmatter, variant=variant),
     ]
@@ -372,7 +291,7 @@ def render_hf_html(doc: HuggingFaceDoc, *, variant: str) -> str:
 
 
 # ---------------------------------------------------------------------------
-# Driver — reads inputs, writes HTML, optionally serves
+# Driver
 # ---------------------------------------------------------------------------
 
 
@@ -390,39 +309,39 @@ class PreviewConfig:
 
 @dataclass(frozen=True)
 class PreviewOutcome:
-    """Return value from :func:`run_preview` — used by tests + CLI."""
+    """Return value from :func:`run_preview`.
+
+    ``cover_path`` is always set on success — the driver always
+    copies the cover into the preview tree.
+    """
 
     html_path: Path
-    cover_path: Path | None
+    cover_path: Path
 
 
 def _resolve_cover_image(release_dir: Path, variant: str) -> Path:
     """Locate the cover image for the variant.
 
-    The HF packager (PR 5.2) copies the cover image into both
-    ``release/huggingface/`` and ``release/huggingface-instructor/``
-    next to each README.  Prefer the variant-tree copy (closest to
-    the artefact the publish PR will upload); fall back to
-    ``release_dir`` for the case where the assembler hasn't been
-    run yet.
+    Lookup order: variant-specific upload tree (assembled by the HF
+    packager — gitignored, absent on a fresh checkout) → committed
+    master copy under ``release_dir``.
     """
 
     variant_dir = "huggingface" if variant == "public" else "huggingface-instructor"
-    candidates = [
+    for candidate in (
         release_dir / variant_dir / "dataset-cover-image.png",
         release_dir / "dataset-cover-image.png",
-    ]
-    for candidate in candidates:
+    ):
         if candidate.is_file():
             return candidate
-    return candidates[0]
+    return release_dir / variant_dir / "dataset-cover-image.png"
 
 
 def run_preview(config: PreviewConfig) -> PreviewOutcome:
     """Render the preview HTML, optionally serve it.
 
     Pre-flight failures (missing README, malformed YAML, missing
-    cover image, unknown variant) raise — the CLI converts to rc=2.
+    cover, unknown variant) raise; the CLI converts to rc=2.
     """
 
     if config.variant not in VALID_VARIANTS:
@@ -432,17 +351,15 @@ def run_preview(config: PreviewConfig) -> PreviewOutcome:
     if not readme_path.is_file():
         raise FileNotFoundError(
             f"HF README not found at {readme_path}; "
-            f"regenerate via scripts/package_hf_release.py "
-            f"--variant={config.variant} first"
+            f"regenerate via scripts/package_hf_release.py --variant={config.variant} first"
         )
     doc = parse_hf_readme(readme_path.read_text(encoding="utf-8"))
 
     cover_src = _resolve_cover_image(config.release_dir, config.variant)
     if not cover_src.is_file():
         raise FileNotFoundError(
-            f"cover image not found at {cover_src} (looked in "
-            f"{config.release_dir}/huggingface{'-instructor' if config.variant == 'instructor' else ''}/ "
-            f"and {config.release_dir}/)"
+            f"cover image not found at {cover_src} "
+            f"(looked in the {config.variant} upload tree and {config.release_dir}/)"
         )
 
     config.out_dir.mkdir(parents=True, exist_ok=True)
@@ -453,37 +370,11 @@ def run_preview(config: PreviewConfig) -> PreviewOutcome:
     replace_file(cover_src, cover_dst)
 
     if config.serve:
-        _serve(config.out_dir, config.port, open_browser=config.open_browser)
+        serve(config.out_dir, config.port, open_browser=config.open_browser)
 
     return PreviewOutcome(html_path=html_path, cover_path=cover_dst)
 
 
-def _serve(directory: Path, port: int, *, open_browser: bool) -> None:
-    """Start a stdlib HTTP server rooted at ``directory`` and block.
-
-    Same posture as the Kaggle preview — see that module for the
-    ``allow_reuse_address`` rationale.
-    """
-
-    handler_factory = _make_handler_factory(directory)
-    url = f"http://localhost:{port}/"
-    print(f"serving {directory} at {url} — Ctrl-C to stop", file=sys.stderr)
-    if open_browser:
-        webbrowser.open(url)
-    with http.server.ThreadingHTTPServer(("", port), handler_factory) as httpd:
-        httpd.serve_forever()
-
-
-def _make_handler_factory(directory: Path) -> type[http.server.SimpleHTTPRequestHandler]:
-    resolved = str(directory.resolve())
-
-    class _Handler(http.server.SimpleHTTPRequestHandler):
-        def __init__(self, *args: Any, **kwargs: Any) -> None:
-            super().__init__(*args, directory=resolved, **kwargs)
-
-    return _Handler
-
-
 # ---------------------------------------------------------------------------
 # CLI
 # ---------------------------------------------------------------------------
diff --git a/scripts/preview_kaggle_page.py b/scripts/preview_kaggle_page.py
index 8e37011..a48c0cb 100644
--- a/scripts/preview_kaggle_page.py
+++ b/scripts/preview_kaggle_page.py
@@ -1,57 +1,44 @@
 #!/usr/bin/env python3
-"""Render an offline mock of the Kaggle dataset page.
-
-PR 7.2 — middle PR in Phase 7 (LLM critique + publish).  Reads the
-artefacts the publish PR will upload (``release/kaggle/dataset-metadata.json``
-+ ``release/dataset-cover-image.png``) and renders an HTML page that
-mimics the public Kaggle dataset view: header (title / subtitle /
-licence / id pill / update-frequency pill), cover image, rendered
-description (the inlined README body), file tree of declared
-resources, schema/columns tables for every tabular resource, and a
-licence + sources footer.
-
-The page exists for human click-through review BEFORE the maintainer
-runs the real ``kaggle datasets create`` upload (PR 7.3).  Cached
-previews on the live page are expensive to roll back, so the
-publish runbook in PR 7.3 cites this script as a required pre-flight.
-
-The rendered HTML is a deterministic function of the input artefacts
-(no ``now()``, no random) — same metadata + cover-image filename →
-byte-identical HTML.  The committed sample at
-``release/_preview_committed/kaggle.html`` is the audit-artefact-sync
-gate (mirrors PR 4.1 / 5.1 / 5.2 / 7.1).
+"""Local publication-readiness preview for the Kaggle dataset page.
 
-Usage::
+PR 7.2.  Reads the artefacts the publish PR will upload
+(``release/kaggle/dataset-metadata.json`` + cover image), renders an
+offline HTML page that surfaces the published structure (header,
+cover, description, file tree, schema tables, sources, footer), and
+optionally serves it on ``http://localhost:8765``.
+
+This is a *publication-readiness* preview — structured rendering of
+the upload artefacts that helps catch link / config / column-listing
+issues before the real ``kaggle datasets create`` upload.  It is
+deliberately NOT a Kaggle look-alike: pixel fidelity is out of scope
+and the chrome (CSS palette, layout) is approximate.
 
-    # Render + serve on http://localhost:8765, pop a browser tab.
-    python scripts/preview_kaggle_page.py --open-browser
+Design rationale + decision log: ``docs/release/preview_pages_design.md``.
 
-    # Just build the HTML (CI / inspection); no server.
-    python scripts/preview_kaggle_page.py --no-serve
+Usage::
 
-Exit codes: 0 success / 2 pre-flight error (missing metadata,
-missing cover image, malformed JSON).
+    python scripts/preview_kaggle_page.py --open-browser  # serve + browser
+    python scripts/preview_kaggle_page.py --no-serve      # build only
+
+Exit codes: 0 success / 2 pre-flight error.
 """
 
 from __future__ import annotations
 
 import argparse
-import http.server
 import json
-import re
 import sys
-import webbrowser
 from collections.abc import Sequence
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Any, Final
 
 # Make ``scripts/`` importable regardless of how this file is loaded
-# (CLI entrypoint, ``importlib.util.spec_from_file_location`` from
-# tests).  Mirrors the pattern in ``package_kaggle_release.py``.
+# (CLI entrypoint, ``importlib.util.spec_from_file_location`` from tests).
 sys.path.insert(0, str(Path(__file__).resolve().parent))
 
-from _release_common import replace_file  # noqa: E402 — must follow sys.path insert
+from _preview_common import escape, serve  # noqa: E402 — must follow sys.path insert
+from _release_common import replace_file  # noqa: E402
 
 # ---------------------------------------------------------------------------
 # Defaults
@@ -63,60 +50,41 @@
 
 
 # ---------------------------------------------------------------------------
-# Markdown rendering (gated behind the [publish] extra)
+# Markdown rendering (markdown-it-py is in [dev] AND [publish])
 # ---------------------------------------------------------------------------
 
 
 def _render_markdown(text: str) -> str:
     """Render ``text`` (the inlined README body) to HTML.
 
-    Uses ``markdown-it-py`` in GFM-like mode (tables, fenced code,
-    autolink, strikethrough) — closest match to how Kaggle renders
-    its description block.  The ``[publish]`` extra (alongside
-    ``datasets`` / ``kaggle``) is the install path; absent dep
-    raises a clear instruction rather than a cryptic ``ImportError``.
-    Footnotes (``[^foo]``) render as literal text, which is faithful
-    enough — Kaggle does not invest in footnote rendering either.
+    ``gfm-like`` preset gives tables / fenced code / strikethrough;
+    ``linkify`` is explicitly disabled so the optional
+    ``linkify-it-py`` transitive dep is not required.
     """
 
     try:
         from markdown_it import MarkdownIt
-    except ImportError as exc:  # pragma: no cover — gated by extra
+    except ImportError as exc:  # pragma: no cover — dep is in [dev]
         raise ImportError(
-            "markdown-it-py is required for the Kaggle preview page. "
-            "Install the publish extra: pip install -e '.[publish]'"
+            "markdown-it-py is required.  pip install -e '.[dev]' (or [publish])."
         ) from exc
-    # ``gfm-like`` enables linkify by default, which requires the
-    # separate ``linkify-it-py`` package; we explicitly turn it off so
-    # the preview does not pull a transitive dep beyond markdown-it-py.
-    # Tables / fenced code / strikethrough remain on (the bits that
-    # actually matter for faithful Kaggle/HF rendering).
-    md = MarkdownIt("gfm-like").disable("linkify")
-    return md.render(text)
+    return MarkdownIt("gfm-like").disable("linkify").render(text)
 
 
 # ---------------------------------------------------------------------------
-# Tier inference + file tree
+# Tier inference
 # ---------------------------------------------------------------------------
 
-#: Kaggle's CLI emits resource paths like ``intro/lead_scoring.csv`` —
-#: the leading path segment is the tier name.  We group resources by
-#: this segment so the rendered file tree mirrors the bundle layout
-#: the user will see on Kaggle.
-_TIER_PATH_RE: Final[re.Pattern[str]] = re.compile(r"^([^/]+)/")
-
 
 def _tier_of(resource_path: str) -> str:
     """Return the leading path segment of ``resource_path``, or ``""``.
 
-    Used to bucket resources by tier in the file tree.  An empty
-    string indicates a top-level resource (none of these are emitted
-    by the Kaggle packager today, but we tolerate them for forward
-    compatibility).
+    Used to bucket resources by tier in the file tree.  Empty string
+    means top-level (none today, tolerated for forward compatibility).
     """
 
-    match = _TIER_PATH_RE.match(resource_path)
-    return match.group(1) if match else ""
+    parts = resource_path.split("/", 1)
+    return parts[0] if len(parts) > 1 else ""
 
 
 # ---------------------------------------------------------------------------
@@ -125,14 +93,18 @@ def _tier_of(resource_path: str) -> str:
 
 
 def _render_header(metadata: dict[str, Any]) -> str:
-    """Render the page header — title, subtitle, id pill, licence pill."""
+    """Render the page header — title, subtitle, id, licence, frequency.
 
-    title = _escape(metadata["title"])
-    subtitle = _escape(metadata["subtitle"])
-    dataset_id = _escape(metadata["id"])
-    license_name = _escape(metadata["licenses"][0]["name"]) if metadata.get("licenses") else ""
-    update_freq = _escape(metadata.get("expectedUpdateFrequency", ""))
-    visibility = "Private" if metadata.get("isPrivate") else "Public"
+    Visibility is intentionally NOT rendered: Kaggle's public dataset
+    page does not display ``isPrivate``, so showing it here would
+    misrepresent what public viewers see.
+    """
+
+    title = escape(metadata["title"])
+    subtitle = escape(metadata["subtitle"])
+    dataset_id = escape(metadata["id"])
+    license_name = escape(metadata["licenses"][0]["name"]) if metadata.get("licenses") else ""
+    update_freq = escape(metadata.get("expectedUpdateFrequency", ""))
 
     return f"""<header class="dataset-header">
   <div class="dataset-header__id">{dataset_id}</div>
@@ -141,7 +113,6 @@ def _render_header(metadata: dict[str, Any]) -> str:
   <ul class="dataset-header__pills">
     <li class="pill pill--license">License: {license_name}</li>
     <li class="pill pill--frequency">Updates: {update_freq}</li>
-    <li class="pill pill--visibility">Visibility: {visibility}</li>
   </ul>
 </header>"""
 
@@ -149,13 +120,12 @@ def _render_header(metadata: dict[str, Any]) -> str:
 def _render_cover(cover_image_filename: str) -> str:
     """Render the cover-image block.
 
-    The ``src`` is a sibling-relative path so the same HTML works
-    against both the runtime preview tree (where the image was copied
-    in) and the committed sample (used for byte-equality only — the
-    sample is not served).
+    Sibling-relative ``src`` so the same HTML works against both the
+    runtime preview tree (where the image was copied in) and the
+    committed sample (which is byte-compared, not served).
     """
 
-    src = _escape(cover_image_filename)
+    src = escape(cover_image_filename)
     return f"""<section class="cover">
   <img class="cover__image" src="{src}" alt="Dataset cover image">
 </section>"""
@@ -164,30 +134,23 @@ def _render_cover(cover_image_filename: str) -> str:
 def _render_description(description_md: str) -> str:
     """Render the inlined README body as HTML."""
 
-    body = _render_markdown(description_md)
-    return f'<section class="description">\n{body}</section>'
+    return f'<section class="description">\n{_render_markdown(description_md)}</section>'
 
 
 def _render_file_tree(resources: list[dict[str, Any]]) -> str:
-    """Render the file tree, grouped by tier (leading path segment).
-
-    Inside each tier, files appear in declaration order — matches the
-    order Kaggle renders the resources column.  Each entry is a
-    monospace path + the resource description.
-    """
+    """Render the file tree, grouped by tier (leading path segment)."""
 
     by_tier: dict[str, list[dict[str, Any]]] = {}
     for resource in resources:
-        tier = _tier_of(resource["path"])
-        by_tier.setdefault(tier, []).append(resource)
+        by_tier.setdefault(_tier_of(resource["path"]), []).append(resource)
 
     blocks: list[str] = []
     for tier, tier_resources in by_tier.items():
-        tier_label = _escape(tier) if tier else "(top-level)"
+        tier_label = escape(tier) if tier else "(top-level)"
         items: list[str] = []
         for resource in tier_resources:
-            path = _escape(resource["path"])
-            description = _escape(resource.get("description", ""))
+            path = escape(resource["path"])
+            description = escape(resource.get("description", ""))
             items.append(
                 f'    <li class="file"><code class="file__path">{path}</code>'
                 f'<span class="file__desc">{description}</span></li>'
@@ -208,14 +171,7 @@ def _render_file_tree(resources: list[dict[str, Any]]) -> str:
 
 
 def _render_schema_tables(resources: list[dict[str, Any]]) -> str:
-    """Render one schema/columns table per tabular resource.
-
-    Mimics Kaggle's "Data Card" expandable per-file column listing.
-    Resources without a ``schema`` (markdown / JSON) are skipped —
-    same posture as Kaggle.  Column count appears in the heading so
-    the test can assert the table is exhaustive without parsing the
-    DOM.
-    """
+    """Render one schema/columns table per tabular resource."""
 
     blocks: list[str] = []
     total_columns = 0
@@ -227,12 +183,12 @@ def _render_schema_tables(resources: list[dict[str, Any]]) -> str:
         if not fields:
             continue
         total_columns += len(fields)
-        path = _escape(resource["path"])
+        path = escape(resource["path"])
         rows: list[str] = []
         for fd in fields:
-            name = _escape(fd.get("name", ""))
-            ftype = _escape(fd.get("type", ""))
-            description = _escape(fd.get("description", ""))
+            name = escape(fd.get("name", ""))
+            ftype = escape(fd.get("type", ""))
+            description = escape(fd.get("description", ""))
             rows.append(
                 f"      <tr>"
                 f'<td class="col__name"><code>{name}</code></td>'
@@ -258,14 +214,14 @@ def _render_schema_tables(resources: list[dict[str, Any]]) -> str:
 
 
 def _render_sources(metadata: dict[str, Any]) -> str:
-    """Render the user-specified sources block."""
+    """Render the user-specified sources block (omitted when empty)."""
 
     sources = metadata.get("userSpecifiedSources", []) or []
     if not sources:
         return ""
     items = "\n".join(
-        f'    <li><a href="{_escape(s["url"])}" target="_blank" rel="noopener noreferrer">'
-        f"{_escape(s['title'])}</a></li>"
+        f'    <li><a href="{escape(s["url"])}" target="_blank" rel="noopener noreferrer">'
+        f"{escape(s['title'])}</a></li>"
         for s in sources
     )
     return f"""<section class="sources">
@@ -280,23 +236,22 @@ def _render_footer(metadata: dict[str, Any]) -> str:
     """Render the licence + keywords footer."""
 
     keywords = metadata.get("keywords", []) or []
-    keyword_chips = " ".join(f'<span class="chip">{_escape(k)}</span>' for k in keywords)
-    license_name = _escape(metadata["licenses"][0]["name"]) if metadata.get("licenses") else ""
+    keyword_chips = " ".join(f'<span class="chip">{escape(k)}</span>' for k in keywords)
+    license_name = escape(metadata["licenses"][0]["name"]) if metadata.get("licenses") else ""
     return f"""<footer class="dataset-footer">
   <div class="dataset-footer__keywords">{keyword_chips}</div>
   <div class="dataset-footer__license">License: {license_name}</div>
-  <div class="dataset-footer__note">Local Kaggle preview rendered by scripts/preview_kaggle_page.py — not the live dataset page.</div>
+  <div class="dataset-footer__note">Local Kaggle publication-readiness preview rendered by scripts/preview_kaggle_page.py — not the live dataset page.</div>
 </footer>"""
 
 
 # ---------------------------------------------------------------------------
-# HTML wrapper + minimal Kaggle-ish CSS
+# HTML wrapper + minimal CSS
 # ---------------------------------------------------------------------------
 
-#: Kept inline rather than served as a separate ``style.css`` so the
-#: rendered HTML is a single self-contained file — easier to inspect,
-#: easier to byte-compare in the audit-artefact-sync test, and works
-#: without a server (open the committed sample directly in a browser).
+#: Inlined for a single self-contained HTML file (easier inspection,
+#: simpler byte-compare in the regeneration-discipline test, works
+#: without a server).  Palette is approximate, not branded.
 _PAGE_CSS: Final[str] = """\
 :root { --bg:#fff; --fg:#202124; --muted:#5f6368; --accent:#20beff; --border:#e0e0e0; --pill-bg:#f1f3f4; }
 body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; color: var(--fg); background: var(--bg); margin: 0; padding: 0; line-height: 1.5; }
@@ -340,18 +295,11 @@ def _render_footer(metadata: dict[str, Any]) -> str:
 
 
 def _wrap_html(*, title: str, body: str) -> str:
-    """Wrap rendered sections in the page chrome.
-
-    Order: header → cover → description → files → schemas → sources →
-    footer.  Description sits above files because Kaggle leads with
-    the dataset card on the public page.
-    """
-
     return f"""<!DOCTYPE html>
 <html lang="en">
 <head>
   <meta charset="utf-8">
-  <title>Kaggle preview — {_escape(title)}</title>
+  <title>Kaggle preview — {escape(title)}</title>
   <style>{_PAGE_CSS}</style>
 </head>
 <body>
@@ -363,24 +311,6 @@ def _wrap_html(*, title: str, body: str) -> str:
 """
 
 
-def _escape(value: str) -> str:
-    """HTML-escape a single attribute / text value.
-
-    Inlined rather than importing ``html.escape`` so the renderer's
-    surface stays small and the (well-tested) substitution is local
-    and obvious.
-    """
-
-    return (
-        str(value)
-        .replace("&", "&amp;")
-        .replace("<", "&lt;")
-        .replace(">", "&gt;")
-        .replace('"', "&quot;")
-        .replace("'", "&#39;")
-    )
-
-
 # ---------------------------------------------------------------------------
 # Top-level renderer
 # ---------------------------------------------------------------------------
@@ -389,9 +319,8 @@ def _escape(value: str) -> str:
 def render_kaggle_html(metadata: dict[str, Any], cover_image_filename: str) -> str:
     """Render the full Kaggle preview HTML.
 
-    Pure function: same ``(metadata, cover_image_filename)`` →
-    byte-identical HTML.  No I/O, no clock, no random.  Tests rely
-    on this for the audit-artefact-sync gate.
+    Pure: same ``(metadata, cover_image_filename)`` → byte-identical
+    HTML.  No I/O, no clock, no random.
     """
 
     body_parts = [
@@ -407,18 +336,13 @@ def render_kaggle_html(metadata: dict[str, Any], cover_image_filename: str) -> s
 
 
 # ---------------------------------------------------------------------------
-# Driver — reads inputs, writes HTML, optionally serves
+# Driver
 # ---------------------------------------------------------------------------
 
 
 @dataclass(frozen=True)
 class PreviewConfig:
-    """Frozen driver config.
-
-    Mirrors the ``DriverConfig`` posture in
-    ``scripts/run_llm_critique.py`` — building this from CLI args
-    keeps the test surface a Python-level call rather than an exec.
-    """
+    """Frozen driver config — built from CLI args or test input."""
 
     release_dir: Path
     out_dir: Path
@@ -429,10 +353,14 @@ class PreviewConfig:
 
 @dataclass(frozen=True)
 class PreviewOutcome:
-    """Return value from :func:`run_preview` — used by tests + CLI."""
+    """Return value from :func:`run_preview`.
+
+    ``cover_path`` is always set on success — the driver always
+    copies the cover into the preview tree.
+    """
 
     html_path: Path
-    cover_path: Path | None
+    cover_path: Path
 
 
 def _resolve_cover_image(release_dir: Path, image_name: str) -> Path:
@@ -441,27 +369,21 @@ def _resolve_cover_image(release_dir: Path, image_name: str) -> Path:
     Lookup order: ``release/kaggle/<image_name>`` (assembled
     upload-tree copy, present after the maintainer runs the Kaggle
     packager — gitignored, so absent on a fresh checkout) →
-    ``release/<image_name>`` (the committed master copy).  Returning
-    the resolved path here mirrors ``_release_common.resolve_cover_image_path``
-    so the assembler and inputs cannot disagree.
+    ``release/<image_name>`` (the committed master copy).
     """
 
-    candidates = [
-        release_dir / "kaggle" / image_name,
-        release_dir / image_name,
-    ]
-    for candidate in candidates:
+    for candidate in (release_dir / "kaggle" / image_name, release_dir / image_name):
         if candidate.is_file():
             return candidate
-    return candidates[0]  # surface the missing-file error against the canonical location
+    return release_dir / "kaggle" / image_name  # surface the missing-file error here
 
 
 def run_preview(config: PreviewConfig) -> PreviewOutcome:
     """Render the preview HTML, optionally serve it.
 
-    Pre-flight failures (missing metadata, malformed JSON, missing
-    cover image) raise — the CLI converts to rc=2.  Validation
-    discipline mirrors the Phase 5 packagers: build → validate → write.
+    Validation discipline: build → validate → write.  Pre-flight
+    failures (missing metadata, malformed JSON, missing cover) raise;
+    the CLI converts to rc=2.
     """
 
     metadata_path = config.release_dir / "kaggle" / "dataset-metadata.json"
@@ -489,55 +411,11 @@ def run_preview(config: PreviewConfig) -> PreviewOutcome:
     replace_file(cover_src, cover_dst)
 
     if config.serve:
-        _serve(config.out_dir, config.port, open_browser=config.open_browser)
+        serve(config.out_dir, config.port, open_browser=config.open_browser)
 
     return PreviewOutcome(html_path=html_path, cover_path=cover_dst)
 
 
-def _serve(directory: Path, port: int, *, open_browser: bool) -> None:
-    """Start a stdlib HTTP server rooted at ``directory`` and block.
-
-    Uses ``http.server.ThreadingHTTPServer`` so the browser can fetch
-    the cover image alongside the HTML without serialising requests.
-    ``ThreadingHTTPServer`` (unlike bare ``socketserver.ThreadingTCPServer``)
-    inherits ``allow_reuse_address = True`` from ``HTTPServer`` —
-    matters because Ctrl-C → re-run within ~60s would otherwise
-    raise ``OSError: [Errno 48] Address already in use`` while the
-    socket sits in TIME_WAIT.
-
-    Blocks on ``serve_forever()``; KeyboardInterrupt (Ctrl-C) is the
-    documented exit path.  No coverage here — tests exercise the
-    pure renderer and ``--no-serve`` path; serving is glue that
-    requires a live socket.
-    """
-
-    handler_factory = _make_handler_factory(directory)
-    url = f"http://localhost:{port}/"
-    print(f"serving {directory} at {url} — Ctrl-C to stop", file=sys.stderr)
-    if open_browser:
-        webbrowser.open(url)
-    with http.server.ThreadingHTTPServer(("", port), handler_factory) as httpd:
-        httpd.serve_forever()
-
-
-def _make_handler_factory(directory: Path) -> type[http.server.SimpleHTTPRequestHandler]:
-    """Build a handler subclass that serves from ``directory``.
-
-    ``SimpleHTTPRequestHandler`` ships a ``directory=`` kwarg in
-    Python 3.7+, but threading the path through ``socketserver``'s
-    ``RequestHandlerClass`` requires either a partial or a subclass.
-    Subclassing keeps the import surface stdlib-only.
-    """
-
-    resolved = str(directory.resolve())
-
-    class _Handler(http.server.SimpleHTTPRequestHandler):
-        def __init__(self, *args: Any, **kwargs: Any) -> None:
-            super().__init__(*args, directory=resolved, **kwargs)
-
-    return _Handler
-
-
 # ---------------------------------------------------------------------------
 # CLI
 # ---------------------------------------------------------------------------
diff --git a/tests/scripts/test_preview_hf_page.py b/tests/scripts/test_preview_hf_page.py
index 468ca1c..3db2d4b 100644
--- a/tests/scripts/test_preview_hf_page.py
+++ b/tests/scripts/test_preview_hf_page.py
@@ -162,16 +162,37 @@ def test_render_data_files_appear_under_each_config() -> None:
     assert "intermediate/train.parquet" in html
 
 
+def test_render_does_not_emit_files_declared_section() -> None:
+    """Real HF doesn't surface a "files declared in YAML" section —
+    showing one would be an internal-concept leak that omits the bulk
+    of the actual upload tree (manifest.json, tables/*.parquet, etc.).
+    The configs dropdown already lists every YAML-declared path; a
+    parallel files section would be misleading duplicate noise.
+    Folded back from self-review pass 3.
+    """
+
+    html = preview.render_hf_html(_minimal_doc(), variant="public")
+    assert "Files declared" not in html
+    assert "files / variant:" not in html  # legacy heading text
+    assert 'class="files"' not in html
+
+
 def test_render_includes_variant_in_footer() -> None:
     public = preview.render_hf_html(_minimal_doc(), variant="public")
     instructor = preview.render_hf_html(_minimal_doc(), variant="instructor")
     assert "Variant: <code>public</code>" in public
     assert "Variant: <code>instructor</code>" in instructor
-    # Variant differences are localised to the footer + file-tree
-    # heading; the rest of the output is identical.
-    public_no_variant = public.replace("public", "VARIANT")
-    instructor_no_variant = instructor.replace("instructor", "VARIANT")
-    assert public_no_variant == instructor_no_variant
+    # Variant differences are localised to the footer; the rest of
+    # the output is identical between variants.  Replace via the
+    # full ``Variant: <code>X</code>`` marker (not the bare word)
+    # so this assertion does not match "public" inside "publication"
+    # in the footer note (regression caught + folded back during
+    # self-review pass 3 reframing).
+    public_normalised = public.replace("Variant: <code>public</code>", "Variant: <code>X</code>")
+    instructor_normalised = instructor.replace(
+        "Variant: <code>instructor</code>", "Variant: <code>X</code>"
+    )
+    assert public_normalised == instructor_normalised
 
 
 def test_render_handles_no_configs_gracefully() -> None:
@@ -369,7 +390,6 @@ def test_run_preview_writes_html_and_copies_cover(tmp_path: Path) -> None:
     outcome = preview.run_preview(_make_config(fake_release, out_dir))  # type: ignore[arg-type]
     assert outcome.html_path == out_dir / "index.html"
     assert outcome.html_path.is_file()
-    assert outcome.cover_path is not None
     assert outcome.cover_path.is_file()
     assert not outcome.cover_path.is_symlink()
 
diff --git a/tests/scripts/test_preview_kaggle_page.py b/tests/scripts/test_preview_kaggle_page.py
index b654a5c..1750408 100644
--- a/tests/scripts/test_preview_kaggle_page.py
+++ b/tests/scripts/test_preview_kaggle_page.py
@@ -116,13 +116,21 @@ def test_render_includes_title_subtitle_id_and_license() -> None:
     assert "testorg/testset-lead-scoring" in html
     assert "License: MIT" in html
     assert "Updates: never" in html
-    assert "Visibility: Private" in html
 
 
-def test_render_includes_visibility_public_when_not_private() -> None:
-    metadata = {**_minimal_metadata(), "isPrivate": False}
-    html = preview.render_kaggle_html(metadata, "dataset-cover-image.png")
-    assert "Visibility: Public" in html
+def test_render_does_not_include_visibility_pill() -> None:
+    """Kaggle's public page does NOT display ``isPrivate``; rendering
+    a ``Visibility:`` pill in the preview would misrepresent what
+    public viewers see (folded back in self-review pass 3)."""
+
+    private_html = preview.render_kaggle_html(_minimal_metadata(), "dataset-cover-image.png")
+    public_html = preview.render_kaggle_html(
+        {**_minimal_metadata(), "isPrivate": False},
+        "dataset-cover-image.png",
+    )
+    for html in (private_html, public_html):
+        assert "Visibility:" not in html
+        assert "pill--visibility" not in html
 
 
 def test_render_file_tree_lists_every_resource_path() -> None:
@@ -371,7 +379,6 @@ def test_run_preview_writes_html_and_copies_cover(tmp_path: Path) -> None:
     )
     assert outcome.html_path == out_dir / "index.html"
     assert outcome.html_path.is_file()
-    assert outcome.cover_path is not None
     assert outcome.cover_path.is_file()
     assert not outcome.cover_path.is_symlink()
     # The HTML references the cover image by sibling-relative name.
@@ -414,3 +421,45 @@ def test_tier_of_extracts_leading_path_segment() -> None:
     assert preview._tier_of("intro/lead_scoring.csv") == "intro"
     assert preview._tier_of("intermediate/tasks/converted/train.parquet") == "intermediate"
     assert preview._tier_of("toplevel.json") == ""
+
+
+# ---------------------------------------------------------------------------
+# Server smoke test — covers _preview_common.make_server / serve glue
+# (folded back from self-review pass 3 — _serve was previously untested)
+# ---------------------------------------------------------------------------
+
+
+def test_make_server_binds_and_serves_index(tmp_path: Path) -> None:
+    """Stand the server up on port 0 (kernel-picked), GET ``/``,
+    assert 200 + body shape, shut down cleanly.
+
+    Covers every path inside ``_preview_common.make_server`` and
+    ``_make_handler_factory`` (handler subclass with ``directory=``,
+    ``ThreadingHTTPServer`` instantiation, address-reuse posture,
+    static-file serving).  ``serve`` itself is the blocking caller
+    that wraps this and is exercised manually.
+    """
+
+    import threading
+    import urllib.request
+
+    import _preview_common  # noqa: PLC0415 — local import for the smoke test
+
+    (tmp_path / "index.html").write_text(
+        "<html><body><h1>preview-smoke-token</h1></body></html>", encoding="utf-8"
+    )
+    httpd = _preview_common.make_server(tmp_path, port=0)
+    bound_port = httpd.server_address[1]
+    assert bound_port > 0
+    thread = threading.Thread(target=httpd.serve_forever, daemon=True)
+    thread.start()
+    try:
+        with urllib.request.urlopen(f"http://localhost:{bound_port}/", timeout=5) as resp:  # noqa: S310 — localhost smoke
+            assert resp.status == 200
+            body = resp.read().decode("utf-8")
+        assert "preview-smoke-token" in body
+    finally:
+        httpd.shutdown()
+        httpd.server_close()
+        thread.join(timeout=5)
+    assert not thread.is_alive()

From b50d53db86ac55f4d1e4cff137618c4dc7f39dd9 Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Sun, 10 May 2026 12:00:13 +0300
Subject: [PATCH 6/6] PR 7.2: address Copilot review findings (3 threads)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

COPILOT-1 (resolve, accept) — _render_header / _render_footer /
_render_cover indexed required metadata keys directly; a malformed
dataset-metadata.json or HF README would have raised KeyError
mid-render and bypassed main()'s rc=2 translation. Added
_validate_required_metadata (Kaggle: title / subtitle / id /
expectedUpdateFrequency / image / licenses[0].name) and
_validate_required_frontmatter (HF: pretty_name / license, with
empty-string-counts-as-missing semantics) to run_preview before
the renderer is touched. Both report every missing key in one
ValueError, not just the first.

COPILOT-2 (resolve, accept with display) — the HF driver was
copying release/dataset-cover-image.png into the preview tree
without ever rendering it. Lifted _render_cover from the Kaggle
script into _preview_common.render_cover (byte-identical helper)
and wired it into render_hf_html via a new HF_COVER_IMAGE_FILENAME
constant + cover_image_filename kwarg. HF preview now displays the
cover under the header pills, mirroring Kaggle's posture and the
HF live page.

COPILOT-3 (resolve, accept) — singular/plural fix. Instructor sample
previously rendered "(1 configs)" / "(1 splits)". Added
_preview_common.plural(n, singular, plural=None) and applied to
all count headings: Kaggle "(N files)", "(N columns)",
"(N tabular files)"; HF "(N configs)", "(N splits)". n=1 now reads
naturally; n=0/2/N still pluralises.

Tests: 7 new (4 Kaggle, 3 HF + the plural unit test) covering each
finding's negative + positive paths. Existing
test_run_preview_raises_on_missing_cover_image updated to use a
well-formed metadata payload (the new validator runs first and
would otherwise short-circuit the cover assertion). Existing
"(2 columns across 1 tabular files)" assertion updated to the
singular form "(2 columns across 1 tabular file)".

Net: 1382/1382 tests pass + 5 publish-extra-gated skips; ruff +
ruff format + mypy clean (83 source files); leakage probes 0/3 on
every tier; hash determinism PASS 67/67; validate_release_candidate
--no-rebuild exits 0; BUNDLE_SCHEMA_VERSION unchanged at 5;
validation_report timestamp drift reverted before commit per the
brief. Committed Kaggle sample byte-identical (no n=1 case
triggered); HF public + instructor samples updated for the cover
image + plural fixes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .../huggingface_instructor.html               |  7 +-
 .../huggingface_public.html                   |  5 ++
 scripts/_preview_common.py                    | 33 ++++++++
 scripts/preview_hf_page.py                    | 60 +++++++++++--
 scripts/preview_kaggle_page.py                | 71 +++++++++++-----
 tests/scripts/test_preview_hf_page.py         | 68 +++++++++++++++
 tests/scripts/test_preview_kaggle_page.py     | 84 ++++++++++++++++++-
 7 files changed, 297 insertions(+), 31 deletions(-)

diff --git a/release/_preview_committed/huggingface_instructor.html b/release/_preview_committed/huggingface_instructor.html
index 756ecaa..0d81395 100644
--- a/release/_preview_committed/huggingface_instructor.html
+++ b/release/_preview_committed/huggingface_instructor.html
@@ -11,6 +11,8 @@
 .dataset-header__title { font-size: 1.8em; margin: 0 0 12px 0; }
 .dataset-header__pills { list-style: none; padding: 0; margin: 0; display: flex; flex-wrap: wrap; gap: 8px; }
 .pill { background: var(--pill-bg); border-radius: 12px; padding: 4px 12px; font-size: 0.85em; color: var(--fg); }
+.cover { margin: 0 0 24px 0; border: 1px solid var(--border); border-radius: 4px; overflow: hidden; }
+.cover__image { display: block; max-width: 100%; height: auto; }
 .tags { margin: 0 0 24px 0; }
 .chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px 4px 2px 0; font-size: 0.85em; color: var(--fg); }
 .section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; }
@@ -46,11 +48,14 @@ <h1 class="dataset-header__title">LeadForge: Synthetic B2B Lead Scoring (v1) —
     <li class="pill pill--language">Language: en</li>
   </ul>
 </header>
+<section class="cover">
+  <img class="cover__image" src="dataset-cover-image.png" alt="Dataset cover image">
+</section>
 <section class="tags">
   <span class="chip">b2b</span> <span class="chip">crm</span> <span class="chip">datasets</span> <span class="chip">lead-scoring</span> <span class="chip">pandas</span> <span class="chip">synthetic-data</span> <span class="chip">tabular</span>
 </section>
 <section class="configs">
-  <h2 class="section__heading">Configurations / Subsets <span class="section__count">(1 configs)</span></h2>
+  <h2 class="section__heading">Configurations / Subsets <span class="section__count">(1 config)</span></h2>
   <details class="config" open>
     <summary class="config__name"><code>intermediate</code> <span class="badge badge--default">default</span> <span class="config__count">(3 splits)</span></summary>
     <table class="config__table">
diff --git a/release/_preview_committed/huggingface_public.html b/release/_preview_committed/huggingface_public.html
index fd67f54..3f1df70 100644
--- a/release/_preview_committed/huggingface_public.html
+++ b/release/_preview_committed/huggingface_public.html
@@ -11,6 +11,8 @@
 .dataset-header__title { font-size: 1.8em; margin: 0 0 12px 0; }
 .dataset-header__pills { list-style: none; padding: 0; margin: 0; display: flex; flex-wrap: wrap; gap: 8px; }
 .pill { background: var(--pill-bg); border-radius: 12px; padding: 4px 12px; font-size: 0.85em; color: var(--fg); }
+.cover { margin: 0 0 24px 0; border: 1px solid var(--border); border-radius: 4px; overflow: hidden; }
+.cover__image { display: block; max-width: 100%; height: auto; }
 .tags { margin: 0 0 24px 0; }
 .chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px 4px 2px 0; font-size: 0.85em; color: var(--fg); }
 .section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; }
@@ -46,6 +48,9 @@ <h1 class="dataset-header__title">LeadForge: Synthetic B2B Lead Scoring (v1)</h1
     <li class="pill pill--language">Language: en</li>
   </ul>
 </header>
+<section class="cover">
+  <img class="cover__image" src="dataset-cover-image.png" alt="Dataset cover image">
+</section>
 <section class="tags">
   <span class="chip">b2b</span> <span class="chip">crm</span> <span class="chip">datasets</span> <span class="chip">lead-scoring</span> <span class="chip">pandas</span> <span class="chip">synthetic-data</span> <span class="chip">tabular</span>
 </section>
diff --git a/scripts/_preview_common.py b/scripts/_preview_common.py
index 06a2e6d..2338a1d 100644
--- a/scripts/_preview_common.py
+++ b/scripts/_preview_common.py
@@ -47,6 +47,39 @@ def escape(value: str) -> str:
     )
 
 
+def plural(n: int, singular: str, plural_form: str | None = None) -> str:
+    """Return ``f"{n} <word>"`` with ``<word>`` pluralised when ``n != 1``.
+
+    Default pluralisation is the trailing-``s`` rule; pass
+    ``plural_form`` for irregular cases (none today).  Used by the
+    preview-page section-heading counts so output reads as "1 config"
+    rather than "1 configs" — the latter was caught in PR review on
+    the instructor sample (Copilot finding COPILOT-3).
+    """
+
+    word = singular if n == 1 else (plural_form or singular + "s")
+    return f"{n} {word}"
+
+
+def render_cover(filename: str) -> str:
+    """Render a sibling-relative cover-image block.
+
+    Used by both preview scripts; the HF preview previously copied
+    the cover into the preview tree without ever rendering it
+    (Copilot finding COPILOT-2 — either drop the copy or display
+    it; we picked display for symmetry with Kaggle and because HF's
+    live page shows the dataset cover too).  Sibling-relative
+    ``src`` so the same HTML works for both the runtime preview
+    tree (where the image was copied in) and the committed sample
+    (which is byte-compared, not served).
+    """
+
+    src = escape(filename)
+    return f"""<section class="cover">
+  <img class="cover__image" src="{src}" alt="Dataset cover image">
+</section>"""
+
+
 def _make_handler_factory(directory: Path) -> type[http.server.SimpleHTTPRequestHandler]:
     """Build a handler subclass that serves from ``directory``.
 
diff --git a/scripts/preview_hf_page.py b/scripts/preview_hf_page.py
index 4da7669..91b5448 100644
--- a/scripts/preview_hf_page.py
+++ b/scripts/preview_hf_page.py
@@ -41,7 +41,12 @@
 # Make ``scripts/`` importable regardless of how this file is loaded.
 sys.path.insert(0, str(Path(__file__).resolve().parent))
 
-from _preview_common import escape, serve  # noqa: E402 — must follow sys.path insert
+from _preview_common import (  # noqa: E402 — must follow sys.path insert
+    escape,
+    plural,
+    render_cover,
+    serve,
+)
 from _release_common import replace_file  # noqa: E402
 
 # ---------------------------------------------------------------------------
@@ -180,7 +185,7 @@ def _render_configs(frontmatter: dict[str, Any]) -> str:
         blocks.append(
             f'  <details class="config" open>\n'
             f'    <summary class="config__name"><code>{config_name}</code>{default_badge} '
-            f'<span class="config__count">({len(data_files)} splits)</span>'
+            f'<span class="config__count">({plural(len(data_files), "split")})</span>'
             f"</summary>\n"
             f'    <table class="config__table">\n'
             f"      <thead><tr><th>Split</th><th>Path</th></tr></thead>\n"
@@ -189,7 +194,7 @@ def _render_configs(frontmatter: dict[str, Any]) -> str:
             f"  </details>"
         )
     return f"""<section class="configs">
-  <h2 class="section__heading">Configurations / Subsets <span class="section__count">({len(configs)} configs)</span></h2>
+  <h2 class="section__heading">Configurations / Subsets <span class="section__count">({plural(len(configs), "config")})</span></h2>
 {chr(10).join(blocks)}
 </section>"""
 
@@ -224,6 +229,8 @@ def _render_footer(frontmatter: dict[str, Any], variant: str) -> str:
 .dataset-header__title { font-size: 1.8em; margin: 0 0 12px 0; }
 .dataset-header__pills { list-style: none; padding: 0; margin: 0; display: flex; flex-wrap: wrap; gap: 8px; }
 .pill { background: var(--pill-bg); border-radius: 12px; padding: 4px 12px; font-size: 0.85em; color: var(--fg); }
+.cover { margin: 0 0 24px 0; border: 1px solid var(--border); border-radius: 4px; overflow: hidden; }
+.cover__image { display: block; max-width: 100%; height: auto; }
 .tags { margin: 0 0 24px 0; }
 .chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px 4px 2px 0; font-size: 0.85em; color: var(--fg); }
 .section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; }
@@ -270,15 +277,32 @@ def _wrap_html(*, title: str, body: str) -> str:
 # ---------------------------------------------------------------------------
 
 
-def render_hf_html(doc: HuggingFaceDoc, *, variant: str) -> str:
+#: Cover-image filename in the HF upload tree.  Pinned (not derived
+#: from the YAML — HF's dataset card doesn't reference the cover; the
+#: file lives at the root of the upload directory and is consumed by
+#: HF's UI, not the README body) so the preview's cover render is
+#: deterministic given just the parsed doc.
+HF_COVER_IMAGE_FILENAME: Final[str] = "dataset-cover-image.png"
+
+
+def render_hf_html(
+    doc: HuggingFaceDoc,
+    *,
+    variant: str,
+    cover_image_filename: str = HF_COVER_IMAGE_FILENAME,
+) -> str:
     """Render the full HF preview HTML.
 
-    Pure: same ``(doc, variant)`` → byte-identical HTML.  No I/O,
-    no clock, no random.
+    Pure: same ``(doc, variant, cover_image_filename)`` → byte-identical
+    HTML.  No I/O, no clock, no random.  The cover-image block was
+    added in self-review pass 4 (Copilot finding COPILOT-2 — the
+    driver was copying the cover into the preview tree without ever
+    rendering it).
     """
 
     body_parts = [
         _render_header(doc.frontmatter),
+        render_cover(cover_image_filename),
         _render_tags(doc.frontmatter),
         _render_configs(doc.frontmatter),
         _render_readme_body(doc.body),
@@ -319,6 +343,29 @@ class PreviewOutcome:
     cover_path: Path
 
 
+#: Required frontmatter keys the renderer indexes directly; validated
+#: up-front in ``run_preview`` so a malformed README surfaces as
+#: ``ValueError`` → CLI rc=2 rather than silently rendering empty
+#: pretty_name / license pills (Copilot finding COPILOT-1, applied
+#: symmetrically to the HF script).
+_REQUIRED_FRONTMATTER_KEYS: Final[tuple[str, ...]] = ("pretty_name", "license")
+
+
+def _validate_required_frontmatter(frontmatter: dict[str, Any], path: Path) -> None:
+    """Raise ``ValueError`` if required HF frontmatter keys are missing.
+
+    ``pretty_name`` and ``license`` are the two HF requires *and* the
+    two we display prominently; missing or empty values would render
+    a half-blank header that's easy to miss.
+    """
+
+    missing = sorted(
+        k for k in _REQUIRED_FRONTMATTER_KEYS if not str(frontmatter.get(k, "")).strip()
+    )
+    if missing:
+        raise ValueError(f"{path} frontmatter is missing required key(s): {', '.join(missing)}")
+
+
 def _resolve_cover_image(release_dir: Path, variant: str) -> Path:
     """Locate the cover image for the variant.
 
@@ -354,6 +401,7 @@ def run_preview(config: PreviewConfig) -> PreviewOutcome:
             f"regenerate via scripts/package_hf_release.py --variant={config.variant} first"
         )
     doc = parse_hf_readme(readme_path.read_text(encoding="utf-8"))
+    _validate_required_frontmatter(doc.frontmatter, readme_path)
 
     cover_src = _resolve_cover_image(config.release_dir, config.variant)
     if not cover_src.is_file():
diff --git a/scripts/preview_kaggle_page.py b/scripts/preview_kaggle_page.py
index a48c0cb..de5a61b 100644
--- a/scripts/preview_kaggle_page.py
+++ b/scripts/preview_kaggle_page.py
@@ -37,7 +37,12 @@
 # (CLI entrypoint, ``importlib.util.spec_from_file_location`` from tests).
 sys.path.insert(0, str(Path(__file__).resolve().parent))
 
-from _preview_common import escape, serve  # noqa: E402 — must follow sys.path insert
+from _preview_common import (  # noqa: E402 — must follow sys.path insert
+    escape,
+    plural,
+    render_cover,
+    serve,
+)
 from _release_common import replace_file  # noqa: E402
 
 # ---------------------------------------------------------------------------
@@ -117,20 +122,6 @@ def _render_header(metadata: dict[str, Any]) -> str:
 </header>"""
 
 
-def _render_cover(cover_image_filename: str) -> str:
-    """Render the cover-image block.
-
-    Sibling-relative ``src`` so the same HTML works against both the
-    runtime preview tree (where the image was copied in) and the
-    committed sample (which is byte-compared, not served).
-    """
-
-    src = escape(cover_image_filename)
-    return f"""<section class="cover">
-  <img class="cover__image" src="{src}" alt="Dataset cover image">
-</section>"""
-
-
 def _render_description(description_md: str) -> str:
     """Render the inlined README body as HTML."""
 
@@ -158,14 +149,13 @@ def _render_file_tree(resources: list[dict[str, Any]]) -> str:
         blocks.append(
             f'  <details class="tier" open>\n'
             f'    <summary class="tier__name">{tier_label}/ '
-            f'<span class="tier__count">({len(tier_resources)} files)</span>'
+            f'<span class="tier__count">({plural(len(tier_resources), "file")})</span>'
             f"</summary>\n"
             f'    <ul class="tier__files">\n' + "\n".join(items) + "\n    </ul>\n"
             "  </details>"
         )
-    file_count = len(resources)
     return f"""<section class="files">
-  <h2 class="section__heading">Data Files <span class="section__count">({file_count} total)</span></h2>
+  <h2 class="section__heading">Data Files <span class="section__count">({len(resources)} total)</span></h2>
 {chr(10).join(blocks)}
 </section>"""
 
@@ -199,7 +189,7 @@ def _render_schema_tables(resources: list[dict[str, Any]]) -> str:
         blocks.append(
             f'  <details class="schema" open>\n'
             f'    <summary class="schema__path"><code>{path}</code> '
-            f'<span class="schema__count">({len(fields)} columns)</span>'
+            f'<span class="schema__count">({plural(len(fields), "column")})</span>'
             f"</summary>\n"
             f'    <table class="schema__table">\n'
             f"      <thead><tr><th>Column</th><th>Type</th><th>Description</th></tr></thead>\n"
@@ -208,7 +198,7 @@ def _render_schema_tables(resources: list[dict[str, Any]]) -> str:
             "  </details>"
         )
     return f"""<section class="schemas">
-  <h2 class="section__heading">Schema / Columns <span class="section__count">({total_columns} columns across {len(blocks)} tabular files)</span></h2>
+  <h2 class="section__heading">Schema / Columns <span class="section__count">({plural(total_columns, "column")} across {plural(len(blocks), "tabular file")})</span></h2>
 {chr(10).join(blocks)}
 </section>"""
 
@@ -325,7 +315,7 @@ def render_kaggle_html(metadata: dict[str, Any], cover_image_filename: str) -> s
 
     body_parts = [
         _render_header(metadata),
-        _render_cover(cover_image_filename),
+        render_cover(cover_image_filename),
         _render_description(metadata.get("description", "")),
         _render_file_tree(metadata.get("resources", [])),
         _render_schema_tables(metadata.get("resources", [])),
@@ -363,6 +353,42 @@ class PreviewOutcome:
     cover_path: Path
 
 
+#: Required keys the renderer indexes directly (without ``.get``);
+#: validated up-front in ``run_preview`` so a malformed metadata file
+#: surfaces as ``ValueError`` → CLI rc=2 rather than a ``KeyError``
+#: traceback mid-render (Copilot finding COPILOT-1).
+_REQUIRED_METADATA_KEYS: Final[tuple[str, ...]] = (
+    "title",
+    "subtitle",
+    "id",
+    "expectedUpdateFrequency",
+    "image",
+)
+
+
+def _validate_required_metadata(metadata: dict[str, Any], path: Path) -> None:
+    """Raise ``ValueError`` if required Kaggle metadata keys are missing.
+
+    Catches the case where ``dataset-metadata.json`` is hand-edited or
+    produced by a future broken packager; the renderer's
+    ``_render_header`` / ``_render_footer`` index these directly and
+    would otherwise raise ``KeyError`` mid-render, bypassing
+    ``main()``'s rc=2 handling.
+    """
+
+    missing = sorted(k for k in _REQUIRED_METADATA_KEYS if k not in metadata)
+    licenses = metadata.get("licenses")
+    if (
+        not isinstance(licenses, list)
+        or not licenses
+        or not isinstance(licenses[0], dict)
+        or "name" not in licenses[0]
+    ):
+        missing.append("licenses[0].name")
+    if missing:
+        raise ValueError(f"{path} is missing required key(s): {', '.join(missing)}")
+
+
 def _resolve_cover_image(release_dir: Path, image_name: str) -> Path:
     """Locate the cover image referenced by the metadata's ``image``.
 
@@ -395,8 +421,9 @@ def run_preview(config: PreviewConfig) -> PreviewOutcome:
     metadata = json.loads(metadata_path.read_text(encoding="utf-8"))
     if not isinstance(metadata, dict):
         raise ValueError(f"{metadata_path} is not a JSON object")
+    _validate_required_metadata(metadata, metadata_path)
 
-    cover_name = metadata.get("image", "")
+    cover_name = metadata["image"]
     if not cover_name:
         raise ValueError(f"{metadata_path} declares no 'image' (cover image filename)")
     cover_src = _resolve_cover_image(config.release_dir, cover_name)
diff --git a/tests/scripts/test_preview_hf_page.py b/tests/scripts/test_preview_hf_page.py
index 3db2d4b..7369e9b 100644
--- a/tests/scripts/test_preview_hf_page.py
+++ b/tests/scripts/test_preview_hf_page.py
@@ -162,6 +162,45 @@ def test_render_data_files_appear_under_each_config() -> None:
     assert "intermediate/train.parquet" in html
 
 
+def test_render_includes_cover_image_block() -> None:
+    """The HF preview must render the cover image (Copilot finding
+    COPILOT-2 — the driver was copying the cover into the preview
+    tree without ever rendering it; either drop the copy or display
+    it; we picked display for symmetry with Kaggle and because HF's
+    live page shows the dataset cover too)."""
+
+    html = preview.render_hf_html(_minimal_doc(), variant="public")
+    assert 'class="cover"' in html
+    assert 'src="dataset-cover-image.png"' in html
+    assert 'alt="Dataset cover image"' in html
+
+
+def test_render_configs_heading_uses_singular_for_one_config() -> None:
+    """Instructor sample previously rendered "(1 configs)" — plural()
+    helper now uses the singular form when n == 1 (Copilot finding
+    COPILOT-3)."""
+
+    one_config_doc = preview.HuggingFaceDoc(
+        frontmatter={
+            "pretty_name": "T",
+            "license": "mit",
+            "configs": [
+                {
+                    "config_name": "intermediate",
+                    "default": True,
+                    "data_files": [{"split": "train", "path": "intermediate/train.parquet"}],
+                },
+            ],
+        },
+        body="body\n",
+    )
+    html = preview.render_hf_html(one_config_doc, variant="instructor")
+    assert "(1 config)" in html  # heading
+    assert "(1 split)" in html  # per-config splits count
+    assert "(1 configs)" not in html
+    assert "(1 splits)" not in html
+
+
 def test_render_does_not_emit_files_declared_section() -> None:
     """Real HF doesn't surface a "files declared in YAML" section —
     showing one would be an internal-concept leak that omits the bulk
@@ -364,6 +403,35 @@ def test_run_preview_raises_on_malformed_readme(tmp_path: Path) -> None:
         preview.run_preview(config)  # type: ignore[arg-type]
 
 
+def test_run_preview_raises_on_missing_required_frontmatter_keys(tmp_path: Path) -> None:
+    """Pre-flight required-key check (Copilot finding COPILOT-1,
+    applied symmetrically to the HF script).  Missing pretty_name /
+    license would otherwise render a half-blank header."""
+
+    fake_release = tmp_path / "release"
+    (fake_release / "huggingface").mkdir(parents=True)
+    (fake_release / "huggingface" / "README.md").write_text(
+        "---\nlanguage:\n  - en\n---\nbody\n", encoding="utf-8"
+    )
+    config = _make_config(fake_release, tmp_path / "preview")
+    with pytest.raises(ValueError, match="missing required key") as exc_info:
+        preview.run_preview(config)  # type: ignore[arg-type]
+    msg = str(exc_info.value)
+    assert "pretty_name" in msg
+    assert "license" in msg
+
+
+def test_validate_required_frontmatter_treats_empty_string_as_missing(tmp_path: Path) -> None:
+    """Whitespace-only or empty values count as missing — a blank
+    pretty_name renders an empty <h1>, which is what the validator
+    is supposed to prevent."""
+
+    with pytest.raises(ValueError, match="missing required key"):
+        preview._validate_required_frontmatter(
+            {"pretty_name": "   ", "license": ""}, tmp_path / "any.md"
+        )
+
+
 def test_run_preview_raises_on_missing_cover(tmp_path: Path) -> None:
     fake_release = tmp_path / "release"
     (fake_release / "huggingface").mkdir(parents=True)
diff --git a/tests/scripts/test_preview_kaggle_page.py b/tests/scripts/test_preview_kaggle_page.py
index 1750408..e056a7b 100644
--- a/tests/scripts/test_preview_kaggle_page.py
+++ b/tests/scripts/test_preview_kaggle_page.py
@@ -151,7 +151,9 @@ def test_render_schema_table_lists_every_column() -> None:
     assert "Opaque id." in html
     assert "(2 columns)" in html  # per-table column count
     # Resources without a schema (manifest.json) do not appear in the table.
-    assert "(2 columns across 1 tabular files)" in html
+    # Note singular "tabular file" — the plural() helper kicks in only when
+    # n != 1 (Copilot finding COPILOT-3).
+    assert "(2 columns across 1 tabular file)" in html
 
 
 def test_render_keywords_appear_as_chips_in_footer() -> None:
@@ -339,11 +341,71 @@ def test_run_preview_raises_on_malformed_metadata(tmp_path: Path) -> None:
         preview.run_preview(config)
 
 
+def test_run_preview_raises_on_missing_required_metadata_keys(tmp_path: Path) -> None:
+    """Pre-flight required-key check (Copilot finding COPILOT-1).
+
+    The renderer's _render_header / _render_footer / _render_cover
+    index ``title`` / ``subtitle`` / ``id`` / ``image`` /
+    ``licenses[0].name`` / ``expectedUpdateFrequency`` directly; a
+    malformed metadata file would otherwise raise ``KeyError``
+    mid-render and bypass main()'s rc=2 translation.  The validator
+    surfaces every missing key in one message, not just the first.
+    """
+
+    fake_release = tmp_path / "release"
+    (fake_release / "kaggle").mkdir(parents=True)
+    # Drop several required keys at once.
+    (fake_release / "kaggle" / "dataset-metadata.json").write_text(
+        json.dumps(
+            {
+                "subtitle": "only the subtitle survives",
+                "licenses": [{"NOT_NAME": "MIT"}],  # malformed: no 'name' inside [0]
+                "image": "dataset-cover-image.png",
+            }
+        ),
+        encoding="utf-8",
+    )
+    config = preview.PreviewConfig(
+        release_dir=fake_release,
+        out_dir=tmp_path / "preview",
+        port=8765,
+        open_browser=False,
+        serve=False,
+    )
+    with pytest.raises(ValueError, match="missing required key") as exc_info:
+        preview.run_preview(config)
+    msg = str(exc_info.value)
+    # All four missing keys reported in one error, alphabetised.
+    assert "expectedUpdateFrequency" in msg
+    assert "id" in msg
+    assert "title" in msg
+    assert "licenses[0].name" in msg
+
+
+def test_validate_required_metadata_accepts_well_formed_payload(tmp_path: Path) -> None:
+    """Sanity gate the validator does not over-fire on the canonical fixture."""
+
+    preview._validate_required_metadata(_minimal_metadata(), tmp_path / "any.json")
+
+
 def test_run_preview_raises_on_missing_cover_image(tmp_path: Path) -> None:
+    """A well-formed metadata payload that points at a missing cover
+    image surfaces FileNotFoundError, not a required-key ValueError.
+
+    The required-key validator (Copilot finding COPILOT-1) runs
+    BEFORE the cover-existence check, so the fixture must include
+    every required key for this assertion to test the cover-path
+    rather than the validator.
+    """
+
     fake_release = tmp_path / "release"
     (fake_release / "kaggle").mkdir(parents=True)
+    well_formed = {
+        **_minimal_metadata(),
+        "image": "missing.png",  # the file does not exist on disk
+    }
     (fake_release / "kaggle" / "dataset-metadata.json").write_text(
-        json.dumps({"image": "missing.png", "resources": []}), encoding="utf-8"
+        json.dumps(well_formed), encoding="utf-8"
     )
     config = preview.PreviewConfig(
         release_dir=fake_release,
@@ -429,6 +491,24 @@ def test_tier_of_extracts_leading_path_segment() -> None:
 # ---------------------------------------------------------------------------
 
 
+def test_plural_helper_handles_singular_zero_and_n() -> None:
+    """``_preview_common.plural`` is the one helper behind every count
+    heading in both preview scripts.  Pin n=1 → singular, n=0/2/N →
+    plural (Copilot finding COPILOT-3 — instructor sample previously
+    rendered "(1 configs)" because the plural was always ``+ 's'``)."""
+
+    import _preview_common  # noqa: PLC0415 — local import for the helper test
+
+    assert _preview_common.plural(1, "config") == "1 config"
+    assert _preview_common.plural(2, "config") == "2 configs"
+    assert _preview_common.plural(0, "config") == "0 configs"  # zero is plural in English
+    assert _preview_common.plural(1, "tabular file") == "1 tabular file"
+    assert _preview_common.plural(5, "tabular file") == "5 tabular files"
+    # Irregular plural form is supported via explicit override (none today).
+    assert _preview_common.plural(1, "child", "children") == "1 child"
+    assert _preview_common.plural(3, "child", "children") == "3 children"
+
+
 def test_make_server_binds_and_serves_index(tmp_path: Path) -> None:
     """Stand the server up on port 0 (kernel-picked), GET ``/``,
     assert 200 + body shape, shut down cleanly.

Split	Path
train	`intermediate/tasks/converted_within_90_days/train.parquet`
validation	`intermediate/tasks/converted_within_90_days/valid.parquet`
test	`intermediate/tasks/converted_within_90_days/test.parquet`
File	Public `intermediate`	Instructor companion
`tables/leads.parquet`	redacted (label dropped)	full (label retained)
`tables/opportunities.parquet`	snapshot-filtered + redacted	full-horizon, full columns
`tables/customers.parquet`	omitted (would leak label)	included
`tables/subscriptions.parquet`	omitted (would leak label)	included
`tables/touches.parquet` etc.	filtered to ≤ snapshot day	full 90-day horizon
`metadata/world_spec.json`	absent	included (DGP + recipe)
`metadata/graph.{graphml,json}`	absent	included (hidden DAG)
`metadata/latent_registry.json`	absent	included (latent traits)
`metadata/mechanism_summary.json`	absent	included (per-edge mechanisms)
Field	Value
Generator	leadforge `1.0.0+`
Recipe	`b2b_saas_procurement_v1`
Canonical seed	42
Bundle schema version	5
Format	Parquet (canonical)
License	MIT — see LICENSE
Public dataset	link
Split	Path
train	`intro/tasks/converted_within_90_days/train.parquet`
validation	`intro/tasks/converted_within_90_days/valid.parquet`
test	`intro/tasks/converted_within_90_days/test.parquet`
Split	Path
train	`advanced/tasks/converted_within_90_days/train.parquet`
validation	`advanced/tasks/converted_within_90_days/valid.parquet`
test	`advanced/tasks/converted_within_90_days/test.parquet`
	Intro	Intermediate	Advanced
Leads	5,000	5,000	5,000
Accounts	1,500	1,500	1,500
Contacts	4,200	4,200	4,200
Snapshot columns	32 / 34*	32 / 34*	32 / 34*
Target	`converted_within_90_days`	`converted_within_90_days`	`converted_within_90_days`
Conversion rate (acceptance band, gate G7.*)	24–61%	12–31%	4–12%
Conversion rate (observed median, seeds 42–46)	42.67%	21.60%	8.40%
Signal strength	0.90	0.70	0.50
Noise scale	0.10	0.30	0.55
Missing rate	2%	8%	18%
Source-of-truth constant	Public bundle treatment
`BANNED_LEAD_COLUMNS = ("converted_within_90_days", "conversion_timestamp")`	Dropped from `tables/leads.parquet`
`BANNED_OPP_COLUMNS = ("close_outcome", "closed_at")`	Dropped from `tables/opportunities.parquet`
`BANNED_TABLES = ("customers", "subscriptions")`	Omitted from public bundles
`SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities)	Filtered per-lead by `lead_created_at + snapshot_day`
Snapshot redaction (`current_stage`, `is_sql`)	Stripped from `tasks/` splits and `tables/leads.parquet`
`total_touches_all` (deliberate trap)	Retained in both modes; flagged `leakage_risk=True`
Tier	LR AUC	AP	P@100	Brier
intro	0.879	0.761	0.80	0.130
intermediate	0.886	0.575	0.59	0.110
advanced	0.886	0.351	0.34	0.061
Column	Type	Description
`split`	string	Task-split membership: one of `train`, `valid`, `test`. Matches the per-row split assignment in `tasks/converted_within_90_days/`.
`account_id`	string	Opaque account identifier.
`industry`	string	Industry vertical of the buying organization.
`region`	string	Geographic region of the account's headquarters.
`employee_band`	string	Banded employee headcount of the account.
`estimated_revenue_band`	string	Banded estimated annual revenue of the account.
`process_maturity_band`	string	Banded internal process maturity score (latent).
`contact_id`	string	Opaque contact identifier.
`role_function`	string	Functional area of the primary contact (e.g. finance, ops).
`seniority`	string	Seniority band of the primary contact.
`buyer_role`	string	Buyer role classification (economic_buyer, champion, etc.).
`lead_id`	string	Opaque lead identifier.
`lead_created_at`	string	ISO-8601 timestamp when the lead was created.
`lead_source`	string	Origination source of the lead (e.g. inbound_form, sdr_outbound).
`first_touch_channel`	string	Marketing channel responsible for the first recorded touch.
`touch_count`	integer	Total number of marketing/sales touches recorded before snapshot.
`inbound_touch_count`	integer	Number of inbound touches before snapshot.
`outbound_touch_count`	integer	Number of outbound touches before snapshot.
`session_count`	integer	Number of web/trial sessions recorded before snapshot.
`pricing_page_views`	integer	Cumulative pricing page views across all sessions before snapshot.
`demo_page_views`	integer	Cumulative demo page views across all sessions before snapshot.
`total_session_duration_seconds`	integer	Sum of session durations (seconds) before snapshot.
`touches_week_1`	integer	Number of touches in the first 7 days after lead creation.
`touches_last_7_days`	integer	Number of touches in the last 7 days before snapshot cutoff.
`days_since_first_touch`	number	Days between first touch and snapshot cutoff (NaN if no touches).
`activity_count`	integer	Number of sales activities logged before snapshot.
`days_since_last_touch`	number	Days elapsed between most recent touch and snapshot cutoff.
`opportunity_created`	boolean	Whether any opportunity was created by snapshot date (open or closed).
`has_open_opportunity`	boolean	Whether an open opportunity existed at snapshot date.
`opportunity_estimated_acv`	number	Estimated ACV of the most recent open opportunity (NaN if none).
`expected_acv`	number	Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
`total_touches_all`	integer	Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
`converted_within_90_days`	boolean	Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
Column	Type	Description
`account_id`	string
`company_name`	string
`industry`	string
`region`	string
`employee_band`	string
`estimated_revenue_band`	string
`process_maturity_band`	string
`created_at`	string