From 15413f89c14a57f2a4a7a930f4f57cc71a603cbe Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Sat, 9 May 2026 08:40:02 +0300 Subject: [PATCH 1/6] PR 7.2 scaffold: design doc + markdown-it-py [publish] dep + .gitignore for preview output MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - docs/release/preview_pages_design.md — 10-decision table covering scripts shape, server (stdlib http.server), templates (f-strings), Markdown renderer (markdown-it-py via [publish] extra), output dirs, CLI shape, audit-artefact-sync, test posture, link-resolution rule, out-of-scope. - pyproject.toml [publish] gains markdown-it-py>=3.0 alongside datasets / kaggle. Same gating posture as PR 5.1 / 5.2 — preview scripts raise a clean ImportError pointing at this extra when missing. - .gitignore: release/_preview/ runtime output excluded; the audit-sync samples under release/_preview_committed/ are checked in separately. Co-Authored-By: Claude Opus 4.7 --- .gitignore | 6 +++ docs/release/preview_pages_design.md | 59 ++++++++++++++++++++++++++++ pyproject.toml | 6 ++- 3 files changed, 70 insertions(+), 1 deletion(-) create mode 100644 docs/release/preview_pages_design.md diff --git a/.gitignore b/.gitignore index a991601..e9893bd 100644 --- a/.gitignore +++ b/.gitignore @@ -233,3 +233,9 @@ release/huggingface/* !release/huggingface/README.md release/huggingface-instructor/* !release/huggingface-instructor/README.md + +# Generated local preview-page output (PR 7.2) — runtime HTML rendered +# by scripts/preview_{kaggle,hf}_page.py. The committed sample HTML +# under release/_preview_committed/ is the audit-artefact-sync gate +# and is checked into git separately. +release/_preview/ diff --git a/docs/release/preview_pages_design.md b/docs/release/preview_pages_design.md new file mode 100644 index 0000000..0742f59 --- /dev/null +++ b/docs/release/preview_pages_design.md @@ -0,0 +1,59 @@ +# PR 7.2 — Local Kaggle / HF preview-page design notes + +Working notes for `scripts/preview_kaggle_page.py`, +`scripts/preview_hf_page.py`, their tests, and the committed +sample-rendered HTML used as the audit-artefact-sync gate. Captured +before implementation; kept short on purpose. + +The PR's pedagogical role is the *staging gate* before PR 7.3: the +maintainer renders both platforms locally from the same artefacts the +publish PR will upload, clicks through them in a browser, and catches +styling / link / YAML-rendering issues before they hit cached +previews on the live page. + +## Decisions + +| # | Decision | Why | +|---|---|---| +| 1 | Two scripts, one per platform. Not a unified renderer. | Kaggle and HF have different inputs (`dataset-metadata.json` vs YAML-frontmatter README) and different page structures (schema/columns table vs configs dropdown). One file per platform keeps each renderer locally complete and the diff readable. | +| 2 | Server: stdlib `http.server.ThreadingHTTPServer` + `webbrowser.open()`. No Flask. | The pages are static HTML over a fixed file tree. A web framework would be a new dep with no benefit; the brief explicitly suggests stdlib. | +| 3 | Templates: f-string helpers, not Jinja2. | Layout is layout-stable; two pages don't justify a templating engine. f-string helpers keep the renderer in one file and free of a new dep. | +| 4 | Markdown→HTML via `markdown-it-py` (added to `[publish]` extra alongside `datasets` / `kaggle`). | Faithfulness is the goal — Kaggle and HF both render the README body as Markdown, hand-rolling a renderer for tables / fenced code / footnotes is brittle. `markdown-it-py` is MIT, pure-Python, CommonMark+GFM. The `[publish]` extra is the right home: this is a publish-pipeline tool, mirrors the PR 5.1 / 5.2 gating posture. Missing dep raises a clean `ImportError` that points at `pip install -e ".[publish]"`. | +| 5 | Output dir: `release/_preview//` (gitignored). | Mirrors `release/_release_quality/` convention. The committed audit-sync samples live at `release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html` so they don't collide with runtime output. | +| 6 | Cover image served from the preview tree (copied in, not referenced). | Both platforms inline-display the cover image; serving it under the preview root means the rendered HTML's `` works without absolute paths. The committed sample HTML uses the same relative reference — no path drift between the sample and what the local server emits. | +| 7 | HF `--variant=public|instructor` reads either `release/huggingface/README.md` or `release/huggingface-instructor/README.md`. Different YAML, different file tree, different name. Kaggle has no instructor variant (Kaggle ships public only). | Matches the publish reality (HF gets a separate instructor companion repo per PR 5.2; Kaggle does not). | +| 8 | CLI mirrors `validate_release_candidate.py` / `run_llm_critique.py`: free-function `parse_args`, frozen `Config`, `run_preview(config) -> Outcome`, `main(argv) -> int`. Exit codes 0 success / 2 pre-flight error. Flags: `--release-dir`, `--port` (8765 Kaggle / 8766 HF), `--out-dir`, `--variant` (HF only), `--open-browser`, `--no-serve`. | Maintainer muscle memory + small surface. `--no-serve` is the CI / inspection mode (build HTML, exit 0). `--open-browser` pops a tab on startup. | +| 9 | Audit-artifact-sync. The renderer is pure: `(metadata.json | README + YAML, cover image filename) -> HTML`. No `now()`, no random. Committed HTML at `release/_preview_committed/*.html` must equal a fresh regeneration byte-for-byte. Same pattern as PR 4.1 / 5.1 / 5.2 / 7.1. | Determinism is the gate against silent drift. The committed HTML doubles as a human-inspectable sample for reviewers who don't want to run the script. | +| 10 | Test posture: in-process. No live HTTP. Each test renders the page once via `render_kaggle_html()` / `render_hf_html()` and asserts against the rendered string with substring + regex. No BeautifulSoup dep (avoidable for the assertion bar we need). The four roadmap-mandated checks: required field labels appear; every Markdown link in the source resolves to a non-404 URL pattern; every config block (HF) round-trips; the Kaggle schema table lists every CSV / parquet column from `resources[].schema.fields`. | Per the brief — no live HTTP, no new test deps unless necessary. Substring assertions on deterministic rendered HTML give the same coverage with less surface. | + +## Link-resolution rule (test pin) + +Every Markdown link `](URL)` in the README body the renderer ingests +must satisfy ONE of: + +1. Absolute `https://github.com/leadforge-dev/leadforge/...` URL (the + rewrite output of `_release_common.py::rewrite_release_links()`). +2. External absolute URL on a known-OK domain (`https://huggingface.co`, + `https://github.com/leadforge-dev/leadforge`, footnote anchors). +3. Relative path that resolves to a file under the upload tree + (e.g. `LICENSE` → `release//LICENSE`). + +A `](../foo)` link or a `](validation/...)` link in the rendered +HTML is a regression — those are exactly what the platform packagers' +rewrite is supposed to canonicalise away. The test fires loud the +moment the rewrite stops doing its job for the upstream artefact the +preview renders. + +## What this PR does not touch + +- `BUNDLE_SCHEMA_VERSION` stays at 5. +- `release/validation/validation_report.{json,md}` does not regenerate + (revert any timestamp drift before commit). +- PR 7.3 (publish + tag) is a separate PR; the runbook there will cite + the two preview commands as a required pre-flight step. +- No change to the platform packagers (`scripts/package_{kaggle,hf}_release.py`) + or `_release_common.py`. The preview reads what the packagers wrote. +- Live Kaggle / HF API calls — pure local rendering only. +- Pixel-perfect cloning of the live pages. The bar is "a maintainer + clicking through it would notice the same broken link, malformed + YAML, or missing config that they'd notice on the live page". diff --git a/pyproject.toml b/pyproject.toml index 79c250c..cd31b4f 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -49,10 +49,14 @@ scripts = [ # this extra (``pip install -e ".[publish]"``) enables the gated # ``load_dataset()`` / Kaggle-CLI smoke tests that verify G11.3 (Kaggle # package) and G12.3 / G12.4 (HF load_dataset round-trip) without -# pulling the heavy SDKs into the default dev install. +# pulling the heavy SDKs into the default dev install. PR 7.2 adds +# ``markdown-it-py`` for the local Kaggle / HF preview pages +# (``scripts/preview_{kaggle,hf}_page.py``) — same publish-extra +# posture, missing import raises a clean error pointing at this extra. publish = [ "datasets>=2.14", "kaggle>=1.6", + "markdown-it-py>=3.0", ] # Optional dependencies for executing the public release notebooks. # Installing this extra (``pip install -e ".[notebooks]"``) enables the From 81b49ad0afcefa82c2215057341bec4402e6ea95 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Sat, 9 May 2026 09:53:10 +0300 Subject: [PATCH 2/6] PR 7.2: local Kaggle / HF preview pages + tests + committed sample HTML MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two scripts + tests + their first rendered-page output committed: - scripts/preview_kaggle_page.py — reads release/kaggle/dataset-metadata.json + cover image, renders an offline HTML page mocking the public Kaggle view (header / cover / description / file tree / schema tables / sources / footer). Serves on localhost:8765 via stdlib ThreadingTCPServer; --no-serve / --open-browser flags. - scripts/preview_hf_page.py — reads release/huggingface[-instructor]/README.md (YAML frontmatter + body), renders the analogous HF view (header pills / tag chips / configs dropdown / file tree / README body / footer). Serves on localhost:8766; --variant=public|instructor reads the matching companion README and writes to a variant-flavoured out_dir. Both renderers are pure: same input → byte-identical HTML (verified two-pass against the real release artefacts). Output landing at release/_preview// is gitignored; committed sample HTML at release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html is the audit-artefact-sync gate. Markdown rendering via markdown-it-py (gfm-like preset, linkify disabled to avoid the linkify-it-py transitive dep). Missing dep raises a clean ImportError pointing at pip install -e '.[publish]'. pyproject.toml ruff per-file-ignores adds E501 for the two scripts — inlined CSS strings in f-string templates are the product, not source code that benefits from a 100c wrap. 48 new tests (no live HTTP, no network): - required field labels (title / subtitle / licence / file count / schema column count for Kaggle; pretty_name / licence / configs / tags for HF) - every Markdown link in the source resolves to a non-404 URL pattern (no ](../, no ](validation/, only allow-listed external prefixes + sibling-relative LICENSE + in-document anchors) - every configs[] block in the HF YAML round-trips into the rendered dropdown - every CSV / parquet column declared in the Kaggle metadata appears in the schema table - byte-deterministic renderer + audit-sync against the committed sample - pre-flight error paths (missing artefact, malformed JSON / YAML, unknown variant) return rc=2 Co-Authored-By: Claude Opus 4.7 --- pyproject.toml | 5 + .../huggingface_instructor.html | 284 ++++ .../huggingface_public.html | 480 ++++++ release/_preview_committed/kaggle.html | 1303 +++++++++++++++++ scripts/preview_hf_page.py | 572 ++++++++ scripts/preview_kaggle_page.py | 608 ++++++++ tests/scripts/test_preview_hf_page.py | 444 ++++++ tests/scripts/test_preview_kaggle_page.py | 416 ++++++ 8 files changed, 4112 insertions(+) create mode 100644 release/_preview_committed/huggingface_instructor.html create mode 100644 release/_preview_committed/huggingface_public.html create mode 100644 release/_preview_committed/kaggle.html create mode 100644 scripts/preview_hf_page.py create mode 100644 scripts/preview_kaggle_page.py create mode 100644 tests/scripts/test_preview_hf_page.py create mode 100644 tests/scripts/test_preview_kaggle_page.py diff --git a/pyproject.toml b/pyproject.toml index cd31b4f..90a742b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -107,6 +107,11 @@ select = ["E", "F", "I", "N", "W", "UP", "B", "C4", "PT", "S"] # Line length is a property of the rendered cell, not the .py source, # so 100c is the wrong yardstick here. "scripts/build_release_notebook_*.py" = ["E501"] +# Preview-page scripts (PR 7.2) carry inlined CSS + multi-attribute +# HTML strings inside f-string templates; the rendered HTML is the +# product, so wrapping the source CSS at 100c is line noise. +"scripts/preview_kaggle_page.py" = ["E501"] +"scripts/preview_hf_page.py" = ["E501"] [tool.mypy] python_version = "3.11" diff --git a/release/_preview_committed/huggingface_instructor.html b/release/_preview_committed/huggingface_instructor.html new file mode 100644 index 0000000..6296a4b --- /dev/null +++ b/release/_preview_committed/huggingface_instructor.html @@ -0,0 +1,284 @@ + + + + + HF preview — LeadForge: Synthetic B2B Lead Scoring (v1) — Instructor companion + + + +
+
+
huggingface.co/datasets
+

LeadForge: Synthetic B2B Lead Scoring (v1) — Instructor companion

+
    +
  • License: mit
  • +
  • Task: tabular-classification
  • +
  • Size: 1K<n<10K
  • +
  • Language: en
  • +
+
+
+ b2b crm datasets lead-scoring pandas synthetic-data tabular +
+
+

Configurations / Subsets (1 configs)

+
+ intermediate default (3 splits) + + + + + + + +
SplitPath
trainintermediate/tasks/converted_within_90_days/train.parquet
validationintermediate/tasks/converted_within_90_days/valid.parquet
testintermediate/tasks/converted_within_90_days/test.parquet
+
+
+
+

Files declared in YAML (3 files / variant: instructor)

+
    +
  • [intermediate] intermediate/tasks/converted_within_90_days/train.parquet
  • +
  • [intermediate] intermediate/tasks/converted_within_90_days/valid.parquet
  • +
  • [intermediate] intermediate/tasks/converted_within_90_days/test.parquet
  • +
+
+
+

LeadForge: Synthetic B2B Lead Scoring (v1) — Instructor companion

+

This is the research / instructor companion to the public +leadforge/leadforge-lead-scoring-v1 +dataset. It exposes the full-horizon view of a single difficulty +tier (intermediate) plus the hidden causal structure that the +public dataset deliberately redacts: the world graph (DAG), latent +trait registry, mechanism summary, and full-horizon relational tables +including customers and subscriptions.

+

It exists for instructors who want to walk students through how the +public dataset was generated, and for researchers who want to verify +that the public redactions actually remove the leakage paths the +dataset advertises. It is not a replacement for the public dataset +in any teaching or modelling context — students should still train +on the public bundle.

+

What this companion contains

+
.
+├── intermediate/                     # research_instructor companion: full-horizon
+│   ├── manifest.json                 # provenance + file hashes
+│   ├── dataset_card.md               # auto-rendered per-bundle card
+│   ├── feature_dictionary.csv        # authoritative column spec
+│   ├── tables/*.parquet              # full-horizon tables (incl. customers, subscriptions)
+│   ├── tasks/converted_within_90_days/{train,valid,test}.parquet
+│   └── metadata/                     # world_spec, graph.{graphml,json}, latent_registry, etc.
+├── README.md                         # this file (HF dataset card)
+├── dataset-cover-image.png           # dataset thumbnail
+└── LICENSE
+
+

The single intermediate config exposes the same train/valid/test +parquet splits as the public dataset's intermediate config — same +seeds, same row counts (3,500 / 750 / 750), same target. The +difference lives in the relational tables and metadata:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FilePublic intermediateInstructor companion
tables/leads.parquetredacted (label dropped)full (label retained)
tables/opportunities.parquetsnapshot-filtered + redactedfull-horizon, full columns
tables/customers.parquetomitted (would leak label)included
tables/subscriptions.parquetomitted (would leak label)included
tables/touches.parquet etc.filtered to ≤ snapshot dayfull 90-day horizon
metadata/world_spec.jsonabsentincluded (DGP + recipe)
metadata/graph.{graphml,json}absentincluded (hidden DAG)
metadata/latent_registry.jsonabsentincluded (latent traits)
metadata/mechanism_summary.jsonabsentincluded (per-edge mechanisms)
+

The redaction contract is single-sourced in +leadforge/validation/leakage_probes.py +and re-applied by +leadforge/render/relational_snapshot_safe.py +when the public bundle is built; this companion is the unfiltered +source view, so the two are always consistent by construction.

+

Quick start

+
from datasets import load_dataset
+
+# Loads the same train/valid/test splits as the public 'intermediate'
+# config; differs only in what `tables/` and `metadata/` provide.
+ds = load_dataset(
+    "leadforge/leadforge-lead-scoring-v1-instructor",
+    name="intermediate",
+)
+train = ds["train"].to_pandas()
+
+# Full-horizon relational tables — includes customers and subscriptions
+# (omitted from the public dataset because their existence reconstructs
+# the conversion label).
+import pandas as pd
+customers = pd.read_parquet(
+    "hf://datasets/leadforge/leadforge-lead-scoring-v1-instructor/intermediate/tables/customers.parquet"
+)
+
+

Intended uses

+
    +
  • Teaching the public-vs-instructor split itself: load both +datasets side-by-side, show students which columns and tables were +redacted, and walk through why each was a leakage path.
  • +
  • Verifying the redaction contract: train a model on the +full-horizon tables, train another on the snapshot-safe public +tables, compare AUC. The gap is the redaction's effect.
  • +
  • Teaching causal structure and DGP transparency using +metadata/world_spec.json + metadata/graph.json.
  • +
  • Reproducing the public dataset from the instructor view via +leadforge source code.
  • +
+

Out-of-scope uses

+
    +
  • Production lead scoring. Same as the public dataset; the +company, product, and customers are fictional.
  • +
  • Modelling with the unredacted view as a baseline. Models +trained against the full-horizon tables look strong because they're +directly seeing post-conversion events. That number is not a +baseline; it's the ceiling.
  • +
  • Demographic / fairness research. v1 does not model protected +attributes.
  • +
+

Composition

+
    +
  • Entities. 9 relational tables (accounts, contacts, leads, +touches, sessions, sales_activities, opportunities, customers, +subscriptions); per-row counts in manifest.json.
  • +
  • Splits. Identical to the public intermediate config: 70/15/15 +train/valid/test, deterministic given seed 42, recorded in +tasks/converted_within_90_days/task_manifest.json.
  • +
  • Provenance. Recipe b2b_saas_procurement_v1, seed 42, package +version stamped in manifest.json along with SHA-256 hashes for +every parquet file.
  • +
  • Bundle schema version. 5 (matches the public dataset).
  • +
+

Maintenance, license

+

We want the dataset to be broken. See the +public dataset card +for the adversarial-framing pointers, the issue templates, and the +break-me guide. File issues at +leadforge-dev/leadforge; +PRs welcome.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldValue
Generatorleadforge 1.0.0+
Recipeb2b_saas_procurement_v1
Canonical seed42
Bundle schema version5
FormatParquet (canonical)
LicenseMIT — see LICENSE
Public datasetlink
+

Verify integrity with leadforge validate <bundle_dir>; every file is +hashed in manifest.json.

+
+
+ + + +
+
+ + diff --git a/release/_preview_committed/huggingface_public.html b/release/_preview_committed/huggingface_public.html new file mode 100644 index 0000000..3f9006a --- /dev/null +++ b/release/_preview_committed/huggingface_public.html @@ -0,0 +1,480 @@ + + + + + HF preview — LeadForge: Synthetic B2B Lead Scoring (v1) + + + +
+
+
huggingface.co/datasets
+

LeadForge: Synthetic B2B Lead Scoring (v1)

+
    +
  • License: mit
  • +
  • Task: tabular-classification
  • +
  • Size: 1K<n<10K
  • +
  • Language: en
  • +
+
+
+ b2b crm datasets lead-scoring pandas synthetic-data tabular +
+
+

Configurations / Subsets (3 configs)

+
+ intro (3 splits) + + + + + + + +
SplitPath
trainintro/tasks/converted_within_90_days/train.parquet
validationintro/tasks/converted_within_90_days/valid.parquet
testintro/tasks/converted_within_90_days/test.parquet
+
+
+ intermediate default (3 splits) + + + + + + + +
SplitPath
trainintermediate/tasks/converted_within_90_days/train.parquet
validationintermediate/tasks/converted_within_90_days/valid.parquet
testintermediate/tasks/converted_within_90_days/test.parquet
+
+
+ advanced (3 splits) + + + + + + + +
SplitPath
trainadvanced/tasks/converted_within_90_days/train.parquet
validationadvanced/tasks/converted_within_90_days/valid.parquet
testadvanced/tasks/converted_within_90_days/test.parquet
+
+
+
+

Files declared in YAML (9 files / variant: public)

+
    +
  • [intro] intro/tasks/converted_within_90_days/train.parquet
  • +
  • [intro] intro/tasks/converted_within_90_days/valid.parquet
  • +
  • [intro] intro/tasks/converted_within_90_days/test.parquet
  • +
  • [intermediate] intermediate/tasks/converted_within_90_days/train.parquet
  • +
  • [intermediate] intermediate/tasks/converted_within_90_days/valid.parquet
  • +
  • [intermediate] intermediate/tasks/converted_within_90_days/test.parquet
  • +
  • [advanced] advanced/tasks/converted_within_90_days/train.parquet
  • +
  • [advanced] advanced/tasks/converted_within_90_days/valid.parquet
  • +
  • [advanced] advanced/tasks/converted_within_90_days/test.parquet
  • +
+
+
+

LeadForge: Synthetic B2B Lead Scoring Dataset (leadforge-lead-scoring-v1)

+

A relational, reproducible, three-tier synthetic CRM dataset family for +teaching lead scoring at scale. Generated by +leadforge, an +open-source Python framework for synthetic CRM/funnel data. The +framework version is decoupled from the dataset version: the package +stays at 1.x; the dataset is published under the explicit …-v1 +tag.

+

Why lead scoring matters in 2024–2026

+

Mid-market SaaS vendors entered 2024–2026 with growth slowing and +customer-acquisition costs rising[^macro], so predicting which leads +convert within a fixed window has moved from a marketing nicety to a +survival skill. This dataset teaches that skill on a relational +substrate, with the realistic confusions (snapshot-window discipline, +leakage traps, channel signal weaker than vendor blogs imply) that +students will hit when they finally get hands on real CRM data.

+

[^macro]: Macroeconomic framing summarised in +docs/external_review/summaries/gemini_v2_summary.md +(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio +rose materially in 2024).

+

What's inside

+
.
+├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier
+│   ├── manifest.json                 # provenance + file hashes
+│   ├── dataset_card.md               # auto-rendered per-bundle card
+│   ├── feature_dictionary.csv        # authoritative column spec
+│   ├── lead_scoring.csv              # flat convenience CSV (all splits)
+│   ├── tables/*.parquet              # 7 snapshot-safe relational tables
+│   └── tasks/converted_within_90_days/{train,valid,test}.parquet
+├── README.md                         # this file (HF dataset card)
+├── dataset-cover-image.png           # dataset thumbnail
+└── LICENSE
+
+

student_public bundles ship the snapshot-safe relational view; +research_instructor companions ship the full-horizon view plus the +hidden causal structure (DAG, latent registry, mechanism summary) +under metadata/. The full layout is documented in each bundle's +manifest.json.

+

Quick start

+
# Flat CSV
+df = pd.read_csv("intermediate/lead_scoring.csv")
+
+# Parquet task splits (recommended)
+train = pd.read_parquet("intermediate/tasks/converted_within_90_days/train.parquet")
+test  = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet")
+
+# Relational tables (feature engineering — example)
+leads   = pd.read_parquet("intermediate/tables/leads.parquet")
+touches = pd.read_parquet("intermediate/tables/touches.parquet")
+my_touch_count = (
+    touches.groupby("lead_id").size().rename("my_touch_count").reset_index()
+)
+features = leads.merge(my_touch_count, on="lead_id", how="left")
+
+# Reproduce from source
+# pip install leadforge
+# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \
+#                    --mode student_public --difficulty intermediate --out my_bundle
+
+

The label converted_within_90_days resolves over a 90-day window; +engagement features (touch_count, session_count, etc.) are +computed strictly over events on days [0, 30]. The deliberate +exception is total_touches_all, the leakage trap — flagged +leakage_risk=True in feature_dictionary.csv. Drop it from your +feature set unless you're demonstrating leakage detection.

+

Dataset summary

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IntroIntermediateAdvanced
Leads5,0005,0005,000
Accounts1,5001,5001,500
Contacts4,2004,2004,200
Snapshot columns32 / 34*32 / 34*32 / 34*
Targetconverted_within_90_daysconverted_within_90_daysconverted_within_90_days
Conversion rate (acceptance band, gate G7.*)24–61%12–31%4–12%
Conversion rate (observed median, seeds 42–46)42.67%21.60%8.40%
Signal strength0.900.700.50
Noise scale0.100.300.55
Missing rate2%8%18%
+

* student_public / research_instructor. Difficulty is modulated +by the simulation engine — signal strength on latent-trait weights, +Gaussian noise on float features, MCAR missingness, outlier rate — +not post-hoc label flipping. The acceptance band is the recipe +gate's tolerance window (v1_acceptance_gates_bands.yaml G7.*), +not the achievable range — observed five-seed spreads sit +comfortably inside the band.

+

The scenario

+

Veridian Technologies is a fictional Series B startup (Austin, US) +selling Veridian Procure, a procurement / AP automation SaaS, to +mid-market firms (200–2,000 employees) in the US and UK. The funnel +runs through inbound marketing (45%), SDR outbound (35%), and +partner referrals (20%); four personas drive deals (VP Finance, AP +Manager, IT Director, Procurement Manager). Task: predict whether +a lead converts (closed_won) within 90 days. ACV bands are +$18k–$120k. See +docs/release/generation_method.md +for the full DGP, and the deeper "what's modelled / approximate / not +modelled" breakdown that this README only summarises.

+

Public vs instructor: what's redacted

+

Filtering happens during rendering, not during simulation. The +redaction contract is single-sourced in +leadforge/validation/leakage_probes.py; +the snapshot-safe writer and the validator import the same constants, +so they cannot drift apart.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Source-of-truth constantPublic bundle treatment
BANNED_LEAD_COLUMNS = ("converted_within_90_days", "conversion_timestamp")Dropped from tables/leads.parquet
BANNED_OPP_COLUMNS = ("close_outcome", "closed_at")Dropped from tables/opportunities.parquet
BANNED_TABLES = ("customers", "subscriptions")Omitted from public bundles
SNAPSHOT_FILTERED_TABLES (touches, sessions, sales_activities, opportunities)Filtered per-lead by lead_created_at + snapshot_day
Snapshot redaction (current_stage, is_sql)Stripped from tasks/ splits and tables/leads.parquet
total_touches_all (deliberate trap)Retained in both modes; flagged leakage_risk=True
+

Each bundle's manifest.json records relational_snapshot_safe, +redacted_columns, and snapshot_day, so the bundle is +self-describing.

+

Calibration

+

Every realism / calibration / difficulty claim in this README is +backed by +validation/validation_report.md, +regenerated by +scripts/validate_release_candidate.py +with bands declared in +docs/release/v1_acceptance_gates_bands.yaml. +Headline cross-seed medians (seeds 42–46):

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TierLR AUCAPP@100Brier
intro0.8790.7610.800.130
intermediate0.8860.5750.590.110
advanced0.8860.3510.340.061
+

AP, P@100, conversion-rate, and lift orderings hold across the +intended difficulty axis (intro > intermediate > advanced).

+

Intended uses

+
    +
  • Teaching baseline lead-scoring on a flat snapshot.
  • +
  • Teaching relational feature engineering against snapshot-safe tables.
  • +
  • Teaching leakage detection (the total_touches_all trap is +designed to be discoverable).
  • +
  • Teaching calibration, lift, P@K, value-aware ranking +(expected_acv × P(convert)), and cohort-shift evaluation.
  • +
  • Comparing model families under a controlled DGP.
  • +
+

Out-of-scope uses

+
    +
  • Production lead scoring. The company, product, and customers are +fictional.
  • +
  • Vendor benchmarking / paper baselines. Difficulty tiers are +calibrated for pedagogy, not cross-paper comparability.
  • +
  • Causal-inference research that requires recovery of the true DGP. +The instructor companion exposes the hidden graph for teaching, not +designed counterfactuals.
  • +
  • Demographic / fairness research. v1 does not model protected +attributes.
  • +
+

Known limitations

+
    +
  • Difficulty signal on raw AUC is flat. LR AUC is ~0.88 across +every tier. Difficulty is visible in AP, P@K, Brier, and value +capture. Treat AUC as a sanity check, not a difficulty signal.
  • +
  • GBM does not consistently beat LR (gate G7.4.4). GBM−LR AUC delta +is slightly negative in every tier (intro −0.0045, intermediate +−0.0072, advanced −0.0133); v1's snapshot is dominated by linear +features. v2 will inject non-linear interactions in the simulator.
  • +
  • Channel signal is weak. Per +docs/release/channel_signal_audit.md, +out-of-sample univariate AUC of lead_source is ≈0.50–0.52 across +all tiers and the per-channel rate spread is ≤0.05. The simulator +does not encode channel-conditional probabilities; channel-conditional +encoding is post-v1 work.
  • +
  • Cohort-shift degradation is small. v1 has no time-of-year drift +baked in; the cohort-shift gate (G6.4) is informational and will +bite in v2.
  • +
+

Composition

+
    +
  • Entities. Accounts, contacts, leads, touches, sessions, +sales_activities, opportunities (public); plus customers and +subscriptions (instructor only). Per-row counts per bundle live in +manifest.json.
  • +
  • Features. 32 public columns grouped by analytical role in +docs/release/feature_dictionary.md; +the per-bundle feature_dictionary.csv is the authoritative +machine-readable spec.
  • +
  • Label. converted_within_90_days (boolean), event-derived from +the simulator. Never sampled directly.
  • +
  • Splits. 70/15/15 train/valid/test, deterministic given seed; +recorded in tasks/converted_within_90_days/task_manifest.json. +Group-leakage warning: the splitter is keyed on lead_id only, +not on account_id or contact_id. On the as-shipped intermediate +bundle, 518 of 557 test accounts (≈93 %) also appear in train; +the contact-level overlap is similar in magnitude. A flat baseline +trained on the random split rides account-level signal across the +split boundary. For a generalisation-faithful number, retrain with +GroupKFold(account_id) (or contact_id) and report both — see +break_me_guide.md §5 for the +detection recipe.
  • +
  • Provenance. Recipe b2b_saas_procurement_v1, seed 42, package +version stamped in manifest.json.
  • +
+

Maintenance, adversarial framing, license

+

We want the dataset to be broken. The +break-me guide catalogues +nine adversarial patterns to look for (leakage, split +contamination, ranking inversions, calibration drift) with +worked-example pointers back into the notebooks. Issue +templates ship under .github/ISSUE_TEMPLATE/: a +breakage report +form for findings on the bundle itself, and a +realism feedback +form for distributional critiques. Accepted findings are +logged in +docs/release/v2_decision_log.md. +File issues at +leadforge-dev/leadforge; +PRs welcome.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldValue
Generatorleadforge 1.0.0+
Recipeb2b_saas_procurement_v1
Canonical seed42 (cross-seed sweep: 42–46)
Bundle schema version5
FormatParquet (canonical) + CSV (convenience)
LicenseMIT — see LICENSE
+

Verify integrity with leadforge validate <bundle_dir>; every file +is hashed in manifest.json.

+
+
+ + + +
+
+ + diff --git a/release/_preview_committed/kaggle.html b/release/_preview_committed/kaggle.html new file mode 100644 index 0000000..4c61444 --- /dev/null +++ b/release/_preview_committed/kaggle.html @@ -0,0 +1,1303 @@ + + + + + Kaggle preview — LeadForge: Synthetic B2B Lead Scoring (v1) + + + +
+
+
leadforge/leadforge-lead-scoring-v1
+

LeadForge: Synthetic B2B Lead Scoring (v1)

+

Three-tier synthetic CRM funnel for leakage-aware lead scoring

+
    +
  • License: MIT
  • +
  • Updates: never
  • +
  • Visibility: Private
  • +
+
+
+ Dataset cover image +
+
+

LeadForge: Synthetic B2B Lead Scoring Dataset (leadforge-lead-scoring-v1)

+

A relational, reproducible, three-tier synthetic CRM dataset family for +teaching lead scoring at scale. Generated by +leadforge, an +open-source Python framework for synthetic CRM/funnel data. The +framework version is decoupled from the dataset version: the package +stays at 1.x; the dataset is published under the explicit …-v1 +tag.

+

Why lead scoring matters in 2024–2026

+

Mid-market SaaS vendors entered 2024–2026 with growth slowing and +customer-acquisition costs rising[^macro], so predicting which leads +convert within a fixed window has moved from a marketing nicety to a +survival skill. This dataset teaches that skill on a relational +substrate, with the realistic confusions (snapshot-window discipline, +leakage traps, channel signal weaker than vendor blogs imply) that +students will hit when they finally get hands on real CRM data.

+

[^macro]: Macroeconomic framing summarised in +docs/external_review/summaries/gemini_v2_summary.md +(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio +rose materially in 2024).

+

What's inside

+
.
+├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier
+│   ├── manifest.json                 # provenance + file hashes
+│   ├── dataset_card.md               # auto-rendered per-bundle card
+│   ├── feature_dictionary.csv        # authoritative column spec
+│   ├── lead_scoring.csv              # flat convenience CSV (all splits)
+│   ├── tables/*.parquet              # 7 snapshot-safe relational tables
+│   └── tasks/converted_within_90_days/{train,valid,test}.parquet
+├── dataset-metadata.json             # Kaggle dataset metadata
+├── dataset-cover-image.png           # Kaggle cover image
+├── README.md                         # Kaggle package README
+└── LICENSE
+
+

student_public bundles ship the snapshot-safe relational view; +research_instructor companions ship the full-horizon view plus the +hidden causal structure (DAG, latent registry, mechanism summary) +under metadata/. The full layout is documented in each bundle's +manifest.json.

+

Quick start

+
# Flat CSV
+df = pd.read_csv("intermediate/lead_scoring.csv")
+
+# Parquet task splits (recommended)
+train = pd.read_parquet("intermediate/tasks/converted_within_90_days/train.parquet")
+test  = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet")
+
+# Relational tables (feature engineering — example)
+leads   = pd.read_parquet("intermediate/tables/leads.parquet")
+touches = pd.read_parquet("intermediate/tables/touches.parquet")
+my_touch_count = (
+    touches.groupby("lead_id").size().rename("my_touch_count").reset_index()
+)
+features = leads.merge(my_touch_count, on="lead_id", how="left")
+
+# Reproduce from source
+# pip install leadforge
+# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \
+#                    --mode student_public --difficulty intermediate --out my_bundle
+
+

The label converted_within_90_days resolves over a 90-day window; +engagement features (touch_count, session_count, etc.) are +computed strictly over events on days [0, 30]. The deliberate +exception is total_touches_all, the leakage trap — flagged +leakage_risk=True in feature_dictionary.csv. Drop it from your +feature set unless you're demonstrating leakage detection.

+

Dataset summary

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IntroIntermediateAdvanced
Leads5,0005,0005,000
Accounts1,5001,5001,500
Contacts4,2004,2004,200
Snapshot columns32 / 34*32 / 34*32 / 34*
Targetconverted_within_90_daysconverted_within_90_daysconverted_within_90_days
Conversion rate (acceptance band, gate G7.*)24–61%12–31%4–12%
Conversion rate (observed median, seeds 42–46)42.67%21.60%8.40%
Signal strength0.900.700.50
Noise scale0.100.300.55
Missing rate2%8%18%
+

* student_public / research_instructor. Difficulty is modulated +by the simulation engine — signal strength on latent-trait weights, +Gaussian noise on float features, MCAR missingness, outlier rate — +not post-hoc label flipping. The acceptance band is the recipe +gate's tolerance window (v1_acceptance_gates_bands.yaml G7.*), +not the achievable range — observed five-seed spreads sit +comfortably inside the band.

+

The scenario

+

Veridian Technologies is a fictional Series B startup (Austin, US) +selling Veridian Procure, a procurement / AP automation SaaS, to +mid-market firms (200–2,000 employees) in the US and UK. The funnel +runs through inbound marketing (45%), SDR outbound (35%), and +partner referrals (20%); four personas drive deals (VP Finance, AP +Manager, IT Director, Procurement Manager). Task: predict whether +a lead converts (closed_won) within 90 days. ACV bands are +$18k–$120k. See +docs/release/generation_method.md +for the full DGP, and the deeper "what's modelled / approximate / not +modelled" breakdown that this README only summarises.

+

Public vs instructor: what's redacted

+

Filtering happens during rendering, not during simulation. The +redaction contract is single-sourced in +leadforge/validation/leakage_probes.py; +the snapshot-safe writer and the validator import the same constants, +so they cannot drift apart.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Source-of-truth constantPublic bundle treatment
BANNED_LEAD_COLUMNS = ("converted_within_90_days", "conversion_timestamp")Dropped from tables/leads.parquet
BANNED_OPP_COLUMNS = ("close_outcome", "closed_at")Dropped from tables/opportunities.parquet
BANNED_TABLES = ("customers", "subscriptions")Omitted from public bundles
SNAPSHOT_FILTERED_TABLES (touches, sessions, sales_activities, opportunities)Filtered per-lead by lead_created_at + snapshot_day
Snapshot redaction (current_stage, is_sql)Stripped from tasks/ splits and tables/leads.parquet
total_touches_all (deliberate trap)Retained in both modes; flagged leakage_risk=True
+

Each bundle's manifest.json records relational_snapshot_safe, +redacted_columns, and snapshot_day, so the bundle is +self-describing.

+

Calibration

+

Every realism / calibration / difficulty claim in this README is +backed by +validation/validation_report.md, +regenerated by +scripts/validate_release_candidate.py +with bands declared in +docs/release/v1_acceptance_gates_bands.yaml. +Headline cross-seed medians (seeds 42–46):

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TierLR AUCAPP@100Brier
intro0.8790.7610.800.130
intermediate0.8860.5750.590.110
advanced0.8860.3510.340.061
+

AP, P@100, conversion-rate, and lift orderings hold across the +intended difficulty axis (intro > intermediate > advanced).

+

Intended uses

+
    +
  • Teaching baseline lead-scoring on a flat snapshot.
  • +
  • Teaching relational feature engineering against snapshot-safe tables.
  • +
  • Teaching leakage detection (the total_touches_all trap is +designed to be discoverable).
  • +
  • Teaching calibration, lift, P@K, value-aware ranking +(expected_acv × P(convert)), and cohort-shift evaluation.
  • +
  • Comparing model families under a controlled DGP.
  • +
+

Out-of-scope uses

+
    +
  • Production lead scoring. The company, product, and customers are +fictional.
  • +
  • Vendor benchmarking / paper baselines. Difficulty tiers are +calibrated for pedagogy, not cross-paper comparability.
  • +
  • Causal-inference research that requires recovery of the true DGP. +The instructor companion exposes the hidden graph for teaching, not +designed counterfactuals.
  • +
  • Demographic / fairness research. v1 does not model protected +attributes.
  • +
+

Known limitations

+
    +
  • Difficulty signal on raw AUC is flat. LR AUC is ~0.88 across +every tier. Difficulty is visible in AP, P@K, Brier, and value +capture. Treat AUC as a sanity check, not a difficulty signal.
  • +
  • GBM does not consistently beat LR (gate G7.4.4). GBM−LR AUC delta +is slightly negative in every tier (intro −0.0045, intermediate +−0.0072, advanced −0.0133); v1's snapshot is dominated by linear +features. v2 will inject non-linear interactions in the simulator.
  • +
  • Channel signal is weak. Per +docs/release/channel_signal_audit.md, +out-of-sample univariate AUC of lead_source is ≈0.50–0.52 across +all tiers and the per-channel rate spread is ≤0.05. The simulator +does not encode channel-conditional probabilities; channel-conditional +encoding is post-v1 work.
  • +
  • Cohort-shift degradation is small. v1 has no time-of-year drift +baked in; the cohort-shift gate (G6.4) is informational and will +bite in v2.
  • +
+

Composition

+
    +
  • Entities. Accounts, contacts, leads, touches, sessions, +sales_activities, opportunities (public); plus customers and +subscriptions (instructor only). Per-row counts per bundle live in +manifest.json.
  • +
  • Features. 32 public columns grouped by analytical role in +docs/release/feature_dictionary.md; +the per-bundle feature_dictionary.csv is the authoritative +machine-readable spec.
  • +
  • Label. converted_within_90_days (boolean), event-derived from +the simulator. Never sampled directly.
  • +
  • Splits. 70/15/15 train/valid/test, deterministic given seed; +recorded in tasks/converted_within_90_days/task_manifest.json. +Group-leakage warning: the splitter is keyed on lead_id only, +not on account_id or contact_id. On the as-shipped intermediate +bundle, 518 of 557 test accounts (≈93 %) also appear in train; +the contact-level overlap is similar in magnitude. A flat baseline +trained on the random split rides account-level signal across the +split boundary. For a generalisation-faithful number, retrain with +GroupKFold(account_id) (or contact_id) and report both — see +break_me_guide.md §5 for the +detection recipe.
  • +
  • Provenance. Recipe b2b_saas_procurement_v1, seed 42, package +version stamped in manifest.json.
  • +
+

Maintenance, adversarial framing, license

+

We want the dataset to be broken. The +break-me guide catalogues +nine adversarial patterns to look for (leakage, split +contamination, ranking inversions, calibration drift) with +worked-example pointers back into the notebooks. Issue +templates ship under .github/ISSUE_TEMPLATE/: a +breakage report +form for findings on the bundle itself, and a +realism feedback +form for distributional critiques. Accepted findings are +logged in +docs/release/v2_decision_log.md. +File issues at +leadforge-dev/leadforge; +PRs welcome.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldValue
Generatorleadforge 1.0.0+
Recipeb2b_saas_procurement_v1
Canonical seed42 (cross-seed sweep: 42–46)
Bundle schema version5
FormatParquet (canonical) + CSV (convenience)
LicenseMIT — see LICENSE
+

Verify integrity with leadforge validate <bundle_dir>; every file +is hashed in manifest.json.

+
+
+

Data Files (42 total)

+
+ intro/ (14 files) +
    +
  • intro/lead_scoring.csvIntro tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.
  • +
  • intro/feature_dictionary.csvIntro tier feature dictionary (canonical column spec).
  • +
  • intro/tasks/converted_within_90_days/train.parquetIntro tier train split for `converted_within_90_days` (3,500 rows).
  • +
  • intro/tasks/converted_within_90_days/valid.parquetIntro tier valid split for `converted_within_90_days` (750 rows).
  • +
  • intro/tasks/converted_within_90_days/test.parquetIntro tier test split for `converted_within_90_days` (750 rows).
  • +
  • intro/tables/accounts.parquetIntro tier `accounts` relational table (1,500 rows) — snapshot-safe.
  • +
  • intro/tables/contacts.parquetIntro tier `contacts` relational table (4,200 rows) — snapshot-safe.
  • +
  • intro/tables/leads.parquetIntro tier `leads` relational table (5,000 rows) — snapshot-safe.
  • +
  • intro/tables/touches.parquetIntro tier `touches` relational table (38,561 rows) — snapshot-safe.
  • +
  • intro/tables/sessions.parquetIntro tier `sessions` relational table (10,171 rows) — snapshot-safe.
  • +
  • intro/tables/sales_activities.parquetIntro tier `sales_activities` relational table (21,358 rows) — snapshot-safe.
  • +
  • intro/tables/opportunities.parquetIntro tier `opportunities` relational table (4,426 rows) — snapshot-safe.
  • +
  • intro/dataset_card.mdIntro tier auto-rendered dataset card.
  • +
  • intro/manifest.jsonIntro tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).
  • +
+
+
+ intermediate/ (14 files) +
    +
  • intermediate/lead_scoring.csvIntermediate tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.
  • +
  • intermediate/feature_dictionary.csvIntermediate tier feature dictionary (canonical column spec).
  • +
  • intermediate/tasks/converted_within_90_days/train.parquetIntermediate tier train split for `converted_within_90_days` (3,500 rows).
  • +
  • intermediate/tasks/converted_within_90_days/valid.parquetIntermediate tier valid split for `converted_within_90_days` (750 rows).
  • +
  • intermediate/tasks/converted_within_90_days/test.parquetIntermediate tier test split for `converted_within_90_days` (750 rows).
  • +
  • intermediate/tables/accounts.parquetIntermediate tier `accounts` relational table (1,500 rows) — snapshot-safe.
  • +
  • intermediate/tables/contacts.parquetIntermediate tier `contacts` relational table (4,200 rows) — snapshot-safe.
  • +
  • intermediate/tables/leads.parquetIntermediate tier `leads` relational table (5,000 rows) — snapshot-safe.
  • +
  • intermediate/tables/touches.parquetIntermediate tier `touches` relational table (38,724 rows) — snapshot-safe.
  • +
  • intermediate/tables/sessions.parquetIntermediate tier `sessions` relational table (10,012 rows) — snapshot-safe.
  • +
  • intermediate/tables/sales_activities.parquetIntermediate tier `sales_activities` relational table (20,679 rows) — snapshot-safe.
  • +
  • intermediate/tables/opportunities.parquetIntermediate tier `opportunities` relational table (4,255 rows) — snapshot-safe.
  • +
  • intermediate/dataset_card.mdIntermediate tier auto-rendered dataset card.
  • +
  • intermediate/manifest.jsonIntermediate tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).
  • +
+
+
+ advanced/ (14 files) +
    +
  • advanced/lead_scoring.csvAdvanced tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.
  • +
  • advanced/feature_dictionary.csvAdvanced tier feature dictionary (canonical column spec).
  • +
  • advanced/tasks/converted_within_90_days/train.parquetAdvanced tier train split for `converted_within_90_days` (3,500 rows).
  • +
  • advanced/tasks/converted_within_90_days/valid.parquetAdvanced tier valid split for `converted_within_90_days` (750 rows).
  • +
  • advanced/tasks/converted_within_90_days/test.parquetAdvanced tier test split for `converted_within_90_days` (750 rows).
  • +
  • advanced/tables/accounts.parquetAdvanced tier `accounts` relational table (1,500 rows) — snapshot-safe.
  • +
  • advanced/tables/contacts.parquetAdvanced tier `contacts` relational table (4,200 rows) — snapshot-safe.
  • +
  • advanced/tables/leads.parquetAdvanced tier `leads` relational table (5,000 rows) — snapshot-safe.
  • +
  • advanced/tables/touches.parquetAdvanced tier `touches` relational table (38,208 rows) — snapshot-safe.
  • +
  • advanced/tables/sessions.parquetAdvanced tier `sessions` relational table (9,942 rows) — snapshot-safe.
  • +
  • advanced/tables/sales_activities.parquetAdvanced tier `sales_activities` relational table (19,995 rows) — snapshot-safe.
  • +
  • advanced/tables/opportunities.parquetAdvanced tier `opportunities` relational table (4,004 rows) — snapshot-safe.
  • +
  • advanced/dataset_card.mdAdvanced tier auto-rendered dataset card.
  • +
  • advanced/manifest.jsonAdvanced tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).
  • +
+
+
+
+

Schema / Columns (534 columns across 33 tabular files)

+
+ intro/lead_scoring.csv (33 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
splitstringTask-split membership: one of `train`, `valid`, `test`. Matches the per-row split assignment in `tasks/converted_within_90_days/`.
account_idstringOpaque account identifier.
industrystringIndustry vertical of the buying organization.
regionstringGeographic region of the account's headquarters.
employee_bandstringBanded employee headcount of the account.
estimated_revenue_bandstringBanded estimated annual revenue of the account.
process_maturity_bandstringBanded internal process maturity score (latent).
contact_idstringOpaque contact identifier.
role_functionstringFunctional area of the primary contact (e.g. finance, ops).
senioritystringSeniority band of the primary contact.
buyer_rolestringBuyer role classification (economic_buyer, champion, etc.).
lead_idstringOpaque lead identifier.
lead_created_atstringISO-8601 timestamp when the lead was created.
lead_sourcestringOrigination source of the lead (e.g. inbound_form, sdr_outbound).
first_touch_channelstringMarketing channel responsible for the first recorded touch.
touch_countintegerTotal number of marketing/sales touches recorded before snapshot.
inbound_touch_countintegerNumber of inbound touches before snapshot.
outbound_touch_countintegerNumber of outbound touches before snapshot.
session_countintegerNumber of web/trial sessions recorded before snapshot.
pricing_page_viewsintegerCumulative pricing page views across all sessions before snapshot.
demo_page_viewsintegerCumulative demo page views across all sessions before snapshot.
total_session_duration_secondsintegerSum of session durations (seconds) before snapshot.
touches_week_1integerNumber of touches in the first 7 days after lead creation.
touches_last_7_daysintegerNumber of touches in the last 7 days before snapshot cutoff.
days_since_first_touchnumberDays between first touch and snapshot cutoff (NaN if no touches).
activity_countintegerNumber of sales activities logged before snapshot.
days_since_last_touchnumberDays elapsed between most recent touch and snapshot cutoff.
opportunity_createdbooleanWhether any opportunity was created by snapshot date (open or closed).
has_open_opportunitybooleanWhether an open opportunity existed at snapshot date.
opportunity_estimated_acvnumberEstimated ACV of the most recent open opportunity (NaN if none).
expected_acvnumberExpected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
total_touches_allintegerTotal touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
converted_within_90_daysbooleanLabel: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
+
+
+ intro/tasks/converted_within_90_days/train.parquet (32 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
contact_idstring
role_functionstring
senioritystring
buyer_rolestring
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
session_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
days_since_last_touchnumber
opportunity_createdboolean
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
converted_within_90_daysboolean
+
+
+ intro/tasks/converted_within_90_days/valid.parquet (32 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
contact_idstring
role_functionstring
senioritystring
buyer_rolestring
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
session_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
days_since_last_touchnumber
opportunity_createdboolean
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
converted_within_90_daysboolean
+
+
+ intro/tasks/converted_within_90_days/test.parquet (32 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
contact_idstring
role_functionstring
senioritystring
buyer_rolestring
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
session_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
days_since_last_touchnumber
opportunity_createdboolean
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
converted_within_90_daysboolean
+
+
+ intro/tables/accounts.parquet (8 columns) + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
company_namestring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
created_atstring
+
+
+ intro/tables/contacts.parquet (8 columns) + + + + + + + + + + + + +
ColumnTypeDescription
contact_idstring
account_idstring
job_titlestring
role_functionstring
senioritystring
buyer_rolestring
email_domain_typestring
created_atstring
+
+
+ intro/tables/leads.parquet (7 columns) + + + + + + + + + + + +
ColumnTypeDescription
lead_idstring
contact_idstring
account_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
owner_rep_idstring
+
+
+ intro/tables/touches.parquet (7 columns) + + + + + + + + + + + +
ColumnTypeDescription
touch_idstring
lead_idstring
touch_timestampstring
touch_typestring
touch_channelstring
touch_directionstring
campaign_idstring
+
+
+ intro/tables/sessions.parquet (8 columns) + + + + + + + + + + + + +
ColumnTypeDescription
session_idstring
lead_idstring
session_timestampstring
session_typestring
page_viewsinteger
pricing_page_viewsinteger
demo_page_viewsinteger
session_duration_secondsinteger
+
+
+ intro/tables/sales_activities.parquet (6 columns) + + + + + + + + + + +
ColumnTypeDescription
activity_idstring
lead_idstring
rep_idstring
activity_timestampstring
activity_typestring
activity_outcomestring
+
+
+ intro/tables/opportunities.parquet (5 columns) + + + + + + + + + +
ColumnTypeDescription
opportunity_idstring
lead_idstring
created_atstring
stagestring
estimated_acvinteger
+
+
+ intermediate/lead_scoring.csv (33 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
splitstringTask-split membership: one of `train`, `valid`, `test`. Matches the per-row split assignment in `tasks/converted_within_90_days/`.
account_idstringOpaque account identifier.
industrystringIndustry vertical of the buying organization.
regionstringGeographic region of the account's headquarters.
employee_bandstringBanded employee headcount of the account.
estimated_revenue_bandstringBanded estimated annual revenue of the account.
process_maturity_bandstringBanded internal process maturity score (latent).
contact_idstringOpaque contact identifier.
role_functionstringFunctional area of the primary contact (e.g. finance, ops).
senioritystringSeniority band of the primary contact.
buyer_rolestringBuyer role classification (economic_buyer, champion, etc.).
lead_idstringOpaque lead identifier.
lead_created_atstringISO-8601 timestamp when the lead was created.
lead_sourcestringOrigination source of the lead (e.g. inbound_form, sdr_outbound).
first_touch_channelstringMarketing channel responsible for the first recorded touch.
touch_countintegerTotal number of marketing/sales touches recorded before snapshot.
inbound_touch_countintegerNumber of inbound touches before snapshot.
outbound_touch_countintegerNumber of outbound touches before snapshot.
session_countintegerNumber of web/trial sessions recorded before snapshot.
pricing_page_viewsintegerCumulative pricing page views across all sessions before snapshot.
demo_page_viewsintegerCumulative demo page views across all sessions before snapshot.
total_session_duration_secondsintegerSum of session durations (seconds) before snapshot.
touches_week_1integerNumber of touches in the first 7 days after lead creation.
touches_last_7_daysintegerNumber of touches in the last 7 days before snapshot cutoff.
days_since_first_touchnumberDays between first touch and snapshot cutoff (NaN if no touches).
activity_countintegerNumber of sales activities logged before snapshot.
days_since_last_touchnumberDays elapsed between most recent touch and snapshot cutoff.
opportunity_createdbooleanWhether any opportunity was created by snapshot date (open or closed).
has_open_opportunitybooleanWhether an open opportunity existed at snapshot date.
opportunity_estimated_acvnumberEstimated ACV of the most recent open opportunity (NaN if none).
expected_acvnumberExpected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
total_touches_allintegerTotal touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
converted_within_90_daysbooleanLabel: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
+
+
+ intermediate/tasks/converted_within_90_days/train.parquet (32 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
contact_idstring
role_functionstring
senioritystring
buyer_rolestring
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
session_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
days_since_last_touchnumber
opportunity_createdboolean
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
converted_within_90_daysboolean
+
+
+ intermediate/tasks/converted_within_90_days/valid.parquet (32 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
contact_idstring
role_functionstring
senioritystring
buyer_rolestring
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
session_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
days_since_last_touchnumber
opportunity_createdboolean
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
converted_within_90_daysboolean
+
+
+ intermediate/tasks/converted_within_90_days/test.parquet (32 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
contact_idstring
role_functionstring
senioritystring
buyer_rolestring
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
session_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
days_since_last_touchnumber
opportunity_createdboolean
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
converted_within_90_daysboolean
+
+
+ intermediate/tables/accounts.parquet (8 columns) + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
company_namestring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
created_atstring
+
+
+ intermediate/tables/contacts.parquet (8 columns) + + + + + + + + + + + + +
ColumnTypeDescription
contact_idstring
account_idstring
job_titlestring
role_functionstring
senioritystring
buyer_rolestring
email_domain_typestring
created_atstring
+
+
+ intermediate/tables/leads.parquet (7 columns) + + + + + + + + + + + +
ColumnTypeDescription
lead_idstring
contact_idstring
account_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
owner_rep_idstring
+
+
+ intermediate/tables/touches.parquet (7 columns) + + + + + + + + + + + +
ColumnTypeDescription
touch_idstring
lead_idstring
touch_timestampstring
touch_typestring
touch_channelstring
touch_directionstring
campaign_idstring
+
+
+ intermediate/tables/sessions.parquet (8 columns) + + + + + + + + + + + + +
ColumnTypeDescription
session_idstring
lead_idstring
session_timestampstring
session_typestring
page_viewsinteger
pricing_page_viewsinteger
demo_page_viewsinteger
session_duration_secondsinteger
+
+
+ intermediate/tables/sales_activities.parquet (6 columns) + + + + + + + + + + +
ColumnTypeDescription
activity_idstring
lead_idstring
rep_idstring
activity_timestampstring
activity_typestring
activity_outcomestring
+
+
+ intermediate/tables/opportunities.parquet (5 columns) + + + + + + + + + +
ColumnTypeDescription
opportunity_idstring
lead_idstring
created_atstring
stagestring
estimated_acvinteger
+
+
+ advanced/lead_scoring.csv (33 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
splitstringTask-split membership: one of `train`, `valid`, `test`. Matches the per-row split assignment in `tasks/converted_within_90_days/`.
account_idstringOpaque account identifier.
industrystringIndustry vertical of the buying organization.
regionstringGeographic region of the account's headquarters.
employee_bandstringBanded employee headcount of the account.
estimated_revenue_bandstringBanded estimated annual revenue of the account.
process_maturity_bandstringBanded internal process maturity score (latent).
contact_idstringOpaque contact identifier.
role_functionstringFunctional area of the primary contact (e.g. finance, ops).
senioritystringSeniority band of the primary contact.
buyer_rolestringBuyer role classification (economic_buyer, champion, etc.).
lead_idstringOpaque lead identifier.
lead_created_atstringISO-8601 timestamp when the lead was created.
lead_sourcestringOrigination source of the lead (e.g. inbound_form, sdr_outbound).
first_touch_channelstringMarketing channel responsible for the first recorded touch.
touch_countintegerTotal number of marketing/sales touches recorded before snapshot.
inbound_touch_countintegerNumber of inbound touches before snapshot.
outbound_touch_countintegerNumber of outbound touches before snapshot.
session_countintegerNumber of web/trial sessions recorded before snapshot.
pricing_page_viewsintegerCumulative pricing page views across all sessions before snapshot.
demo_page_viewsintegerCumulative demo page views across all sessions before snapshot.
total_session_duration_secondsintegerSum of session durations (seconds) before snapshot.
touches_week_1integerNumber of touches in the first 7 days after lead creation.
touches_last_7_daysintegerNumber of touches in the last 7 days before snapshot cutoff.
days_since_first_touchnumberDays between first touch and snapshot cutoff (NaN if no touches).
activity_countintegerNumber of sales activities logged before snapshot.
days_since_last_touchnumberDays elapsed between most recent touch and snapshot cutoff.
opportunity_createdbooleanWhether any opportunity was created by snapshot date (open or closed).
has_open_opportunitybooleanWhether an open opportunity existed at snapshot date.
opportunity_estimated_acvnumberEstimated ACV of the most recent open opportunity (NaN if none).
expected_acvnumberExpected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
total_touches_allintegerTotal touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
converted_within_90_daysbooleanLabel: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
+
+
+ advanced/tasks/converted_within_90_days/train.parquet (32 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
contact_idstring
role_functionstring
senioritystring
buyer_rolestring
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
session_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
days_since_last_touchnumber
opportunity_createdboolean
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
converted_within_90_daysboolean
+
+
+ advanced/tasks/converted_within_90_days/valid.parquet (32 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
contact_idstring
role_functionstring
senioritystring
buyer_rolestring
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
session_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
days_since_last_touchnumber
opportunity_createdboolean
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
converted_within_90_daysboolean
+
+
+ advanced/tasks/converted_within_90_days/test.parquet (32 columns) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
contact_idstring
role_functionstring
senioritystring
buyer_rolestring
lead_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
touch_countnumber
inbound_touch_countnumber
outbound_touch_countnumber
session_countnumber
pricing_page_viewsnumber
demo_page_viewsnumber
total_session_duration_secondsnumber
touches_week_1number
touches_last_7_daysnumber
days_since_first_touchnumber
activity_countnumber
days_since_last_touchnumber
opportunity_createdboolean
has_open_opportunityboolean
opportunity_estimated_acvnumber
expected_acvnumber
total_touches_allnumber
converted_within_90_daysboolean
+
+
+ advanced/tables/accounts.parquet (8 columns) + + + + + + + + + + + + +
ColumnTypeDescription
account_idstring
company_namestring
industrystring
regionstring
employee_bandstring
estimated_revenue_bandstring
process_maturity_bandstring
created_atstring
+
+
+ advanced/tables/contacts.parquet (8 columns) + + + + + + + + + + + + +
ColumnTypeDescription
contact_idstring
account_idstring
job_titlestring
role_functionstring
senioritystring
buyer_rolestring
email_domain_typestring
created_atstring
+
+
+ advanced/tables/leads.parquet (7 columns) + + + + + + + + + + + +
ColumnTypeDescription
lead_idstring
contact_idstring
account_idstring
lead_created_atstring
lead_sourcestring
first_touch_channelstring
owner_rep_idstring
+
+
+ advanced/tables/touches.parquet (7 columns) + + + + + + + + + + + +
ColumnTypeDescription
touch_idstring
lead_idstring
touch_timestampstring
touch_typestring
touch_channelstring
touch_directionstring
campaign_idstring
+
+
+ advanced/tables/sessions.parquet (8 columns) + + + + + + + + + + + + +
ColumnTypeDescription
session_idstring
lead_idstring
session_timestampstring
session_typestring
page_viewsinteger
pricing_page_viewsinteger
demo_page_viewsinteger
session_duration_secondsinteger
+
+
+ advanced/tables/sales_activities.parquet (6 columns) + + + + + + + + + + +
ColumnTypeDescription
activity_idstring
lead_idstring
rep_idstring
activity_timestampstring
activity_typestring
activity_outcomestring
+
+
+ advanced/tables/opportunities.parquet (5 columns) + + + + + + + + + +
ColumnTypeDescription
opportunity_idstring
lead_idstring
created_atstring
stagestring
estimated_acvinteger
+
+
+
+

Sources

+ +
+
+ + + +
+
+ + diff --git a/scripts/preview_hf_page.py b/scripts/preview_hf_page.py new file mode 100644 index 0000000..51dcebd --- /dev/null +++ b/scripts/preview_hf_page.py @@ -0,0 +1,572 @@ +#!/usr/bin/env python3 +"""Render an offline mock of the Hugging Face dataset page. + +PR 7.2 — middle PR in Phase 7 (LLM critique + publish). Reads the +artefact the publish PR will upload (``release/huggingface/README.md`` +or ``release/huggingface-instructor/README.md``) and renders an HTML +page that mimics the public HF dataset view: header (pretty_name + +licence + size pill), tag chips, configs dropdown, file tree, the +README body, and a footer with sources. + +Same rationale as ``preview_kaggle_page.py`` — cached previews on +the live HF page are expensive to roll back, so the publish runbook +in PR 7.3 cites this script as a required pre-flight. + +The rendered HTML is a deterministic function of the input README +(no ``now()``, no random) — same input → byte-identical HTML. The +committed samples at +``release/_preview_committed/huggingface_{public,instructor}.html`` +are the audit-artefact-sync gate. + +Usage:: + + # Public variant on http://localhost:8766. + python scripts/preview_hf_page.py --open-browser + + # Instructor companion variant (separate input README). + python scripts/preview_hf_page.py --variant=instructor + + # Just build the HTML (CI / inspection). + python scripts/preview_hf_page.py --no-serve + +Exit codes: 0 success / 2 pre-flight error (missing README, +malformed YAML frontmatter, missing cover image). +""" + +from __future__ import annotations + +import argparse +import http.server +import re +import socketserver +import sys +import webbrowser +from collections.abc import Sequence +from dataclasses import dataclass +from pathlib import Path +from typing import Any, Final + +import yaml + +# Make ``scripts/`` importable regardless of how this file is loaded. +sys.path.insert(0, str(Path(__file__).resolve().parent)) + +from _release_common import replace_file # noqa: E402 — must follow sys.path insert + +# --------------------------------------------------------------------------- +# Defaults +# --------------------------------------------------------------------------- + +DEFAULT_RELEASE_DIR: Final[Path] = Path("release") +DEFAULT_OUT_DIR_PUBLIC: Final[Path] = Path("release/_preview/huggingface") +DEFAULT_OUT_DIR_INSTRUCTOR: Final[Path] = Path("release/_preview/huggingface-instructor") +DEFAULT_PORT: Final[int] = 8766 + +#: Per-variant relative paths to the README (under ``release_dir``) +#: and the committed sample HTML (under ``release/_preview_committed/``). +_VARIANT_README_REL: Final[dict[str, Path]] = { + "public": Path("huggingface/README.md"), + "instructor": Path("huggingface-instructor/README.md"), +} +_VARIANT_SAMPLE_PATH: Final[dict[str, Path]] = { + "public": Path("release/_preview_committed/huggingface_public.html"), + "instructor": Path("release/_preview_committed/huggingface_instructor.html"), +} +VALID_VARIANTS: Final[tuple[str, ...]] = ("public", "instructor") + + +# --------------------------------------------------------------------------- +# Markdown rendering (gated behind the [publish] extra) +# --------------------------------------------------------------------------- + + +def _render_markdown(text: str) -> str: + """Render ``text`` to HTML using markdown-it-py in GFM-like mode. + + Same posture + dep gating as the Kaggle preview (markdown-it-py + via the ``[publish]`` extra; ``linkify`` disabled so the + transitive ``linkify-it-py`` dep is not required). See + ``preview_kaggle_page.py`` for the rationale. + """ + + try: + from markdown_it import MarkdownIt + except ImportError as exc: # pragma: no cover — gated by extra + raise ImportError( + "markdown-it-py is required for the Hugging Face preview page. " + "Install the publish extra: pip install -e '.[publish]'" + ) from exc + md = MarkdownIt("gfm-like").disable("linkify") + return md.render(text) + + +# --------------------------------------------------------------------------- +# Frontmatter parsing +# --------------------------------------------------------------------------- + +#: HF dataset cards open with a ``---`` block of YAML, then the body. +#: This regex pulls them apart in one shot; ``re.DOTALL`` is essential +#: because the YAML spans multiple lines. +_FRONTMATTER_RE: Final[re.Pattern[str]] = re.compile( + r"\A---\n(?P.*?)\n---\n(?P.*)\Z", + re.DOTALL, +) + + +@dataclass(frozen=True) +class HuggingFaceDoc: + """Parsed HF README — frontmatter dict + body markdown.""" + + frontmatter: dict[str, Any] + body: str + + +def parse_hf_readme(text: str) -> HuggingFaceDoc: + """Split an HF README into YAML frontmatter + Markdown body. + + Raises ``ValueError`` if the document does not open with a + ``---``-delimited frontmatter block (every HF dataset card MUST + have one — the renderer cannot mock the page without it). + """ + + match = _FRONTMATTER_RE.match(text) + if not match: + raise ValueError( + "HF README is missing a YAML frontmatter block (expected '---\\n\\n---\\n')" + ) + parsed = yaml.safe_load(match.group("yaml")) or {} + if not isinstance(parsed, dict): + raise ValueError( + f"HF README frontmatter is not a YAML mapping (got {type(parsed).__name__})" + ) + return HuggingFaceDoc(frontmatter=parsed, body=match.group("body")) + + +# --------------------------------------------------------------------------- +# Section renderers — pure, deterministic +# --------------------------------------------------------------------------- + + +def _escape(value: str) -> str: + """HTML-escape a single attribute / text value.""" + + return ( + str(value) + .replace("&", "&") + .replace("<", "<") + .replace(">", ">") + .replace('"', """) + .replace("'", "'") + ) + + +def _render_header(frontmatter: dict[str, Any]) -> str: + """Render the page header — pretty_name, licence pill, sizes.""" + + pretty_name = _escape(str(frontmatter.get("pretty_name", ""))) + license_id = _escape(str(frontmatter.get("license", ""))) + languages = ", ".join(_escape(str(x)) for x in frontmatter.get("language", []) or []) + sizes = ", ".join(_escape(str(x)) for x in frontmatter.get("size_categories", []) or []) + tasks = ", ".join(_escape(str(x)) for x in frontmatter.get("task_categories", []) or []) + return f"""
+
huggingface.co/datasets
+

{pretty_name}

+
    +
  • License: {license_id}
  • +
  • Task: {tasks}
  • +
  • Size: {sizes}
  • +
  • Language: {languages}
  • +
+
""" + + +def _render_tags(frontmatter: dict[str, Any]) -> str: + """Render the tag chip row (mimics HF tag pills under the header).""" + + tags = frontmatter.get("tags", []) or [] + if not tags: + return "" + chips = " ".join(f'{_escape(str(t))}' for t in tags) + return f'
\n {chips}\n
' + + +def _render_configs(frontmatter: dict[str, Any]) -> str: + """Render the configs dropdown — one entry per ``configs[]`` block. + + Mirrors HF's "Subset" selector at the top of the dataset viewer. + Each config lists its data_files (split → path) so the test can + assert every config block from the YAML round-trips through to + the rendered page. The default config is flagged. + """ + + configs = frontmatter.get("configs", []) or [] + if not configs: + return '

No configs declared.

' + blocks: list[str] = [] + for config in configs: + config_name = _escape(str(config.get("config_name", ""))) + is_default = bool(config.get("default")) + default_badge = ' default' if is_default else "" + data_files = config.get("data_files", []) or [] + rows = "\n".join( + f" {_escape(str(df.get('split', '')))}" + f"{_escape(str(df.get('path', '')))}" + for df in data_files + ) + blocks.append( + f'
\n' + f' {config_name}{default_badge} ' + f'({len(data_files)} splits)' + f"\n" + f' \n' + f" \n" + f" \n{rows}\n \n" + f"
SplitPath
\n" + f"
" + ) + return f"""
+

Configurations / Subsets ({len(configs)} configs)

+{chr(10).join(blocks)} +
""" + + +def _render_file_tree(frontmatter: dict[str, Any], variant: str) -> str: + """Render the file tree. + + HF doesn't ship a structured file inventory in the dataset card + YAML the way Kaggle does — ``data_files`` are the only paths + declared in the frontmatter. We list each declared path under + its config heading. The tree is therefore narrower than the + real dataset (which also has ``manifest.json``, ``tables/``, etc.) + but matches what the YAML knows about, which is what the publish + runbook is trying to verify. + """ + + configs = frontmatter.get("configs", []) or [] + paths: list[tuple[str, str]] = [] + for config in configs: + config_name = str(config.get("config_name", "")) + for df in config.get("data_files", []) or []: + paths.append((config_name, str(df.get("path", "")))) + if not paths: + return "" + items = "\n".join( + f'
  • [{_escape(c)}] ' + f'{_escape(p)}
  • ' + for c, p in paths + ) + return f"""
    +

    Files declared in YAML ({len(paths)} files / variant: {_escape(variant)})

    +
      +{items} +
    +
    """ + + +def _render_readme_body(body_md: str) -> str: + """Render the README body (everything after the YAML).""" + + return f'
    \n{_render_markdown(body_md)}
    ' + + +def _render_footer(frontmatter: dict[str, Any], variant: str) -> str: + """Render the licence + variant note footer.""" + + license_id = _escape(str(frontmatter.get("license", ""))) + return f"""
    + + + +
    """ + + +# --------------------------------------------------------------------------- +# HTML wrapper + minimal HF-ish CSS +# --------------------------------------------------------------------------- + +#: Inlined for the same reasons as the Kaggle preview — single +#: self-contained file, simple byte-comparison in the audit-sync test, +#: works without the server. +_PAGE_CSS: Final[str] = """\ +:root { --bg:#fff; --fg:#1f2937; --muted:#6b7280; --accent:#ff9d00; --border:#e5e7eb; --pill-bg:#f3f4f6; --code-bg:#f9fafb; } +body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, sans-serif; color: var(--fg); background: var(--bg); margin: 0; padding: 0; line-height: 1.6; } +.container { max-width: 1100px; margin: 0 auto; padding: 24px 32px; } +.dataset-header { border-bottom: 1px solid var(--border); padding-bottom: 16px; margin-bottom: 24px; } +.dataset-header__namespace { color: var(--muted); font-size: 0.85em; font-family: monospace; margin-bottom: 4px; } +.dataset-header__title { font-size: 1.8em; margin: 0 0 12px 0; } +.dataset-header__pills { list-style: none; padding: 0; margin: 0; display: flex; flex-wrap: wrap; gap: 8px; } +.pill { background: var(--pill-bg); border-radius: 12px; padding: 4px 12px; font-size: 0.85em; color: var(--fg); } +.tags { margin: 0 0 24px 0; } +.chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px 4px 2px 0; font-size: 0.85em; color: var(--fg); } +.section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; } +.section__count { color: var(--muted); font-size: 0.7em; font-weight: normal; } +.config, .file-tree { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; } +.config__name { cursor: pointer; font-weight: 600; } +.config__count { color: var(--muted); font-weight: normal; font-size: 0.85em; } +.badge { display: inline-block; padding: 1px 8px; border-radius: 4px; font-size: 0.75em; font-weight: 600; vertical-align: middle; margin-left: 4px; } +.badge--default { background: var(--accent); color: white; } +.config__table { width: 100%; border-collapse: collapse; margin-top: 8px; font-size: 0.9em; } +.config__table th, .config__table td { text-align: left; padding: 6px 8px; border-bottom: 1px solid var(--border); } +.config__table th { background: var(--pill-bg); font-weight: 600; } +.files__list { list-style: none; padding-left: 0; margin: 0; } +.file { padding: 4px 0; border-bottom: 1px dotted var(--border); } +.file:last-child { border-bottom: none; } +.file__config { color: var(--muted); font-size: 0.85em; margin-right: 8px; } +.file__path { color: var(--accent); } +.readme { margin: 24px 0; } +.readme code { background: var(--code-bg); padding: 1px 4px; border-radius: 2px; font-size: 0.9em; } +.readme pre { background: var(--code-bg); padding: 12px; border-radius: 4px; overflow-x: auto; } +.readme pre code { background: none; padding: 0; } +.readme table { border-collapse: collapse; margin: 12px 0; } +.readme th, .readme td { border: 1px solid var(--border); padding: 6px 10px; text-align: left; } +.readme blockquote { border-left: 3px solid var(--accent); padding-left: 12px; color: var(--muted); margin: 12px 0; } +.dataset-footer { margin-top: 48px; padding-top: 16px; border-top: 1px solid var(--border); color: var(--muted); font-size: 0.9em; } +.dataset-footer__note { font-style: italic; margin-top: 8px; } +""" + + +def _wrap_html(*, title: str, body: str) -> str: + """Wrap the rendered sections in page chrome. + + Order: header → tags → configs → files → readme body → footer. + Configs sit above the README because that's the primary affordance + on the live HF dataset page (the user picks a subset before + reading the body). + """ + + return f""" + + + + HF preview — {_escape(title)} + + + +
    +{body} +
    + + +""" + + +# --------------------------------------------------------------------------- +# Top-level renderer +# --------------------------------------------------------------------------- + + +def render_hf_html(doc: HuggingFaceDoc, *, variant: str) -> str: + """Render the full HF preview HTML. + + Pure function: same ``(doc, variant)`` → byte-identical HTML. + No I/O, no clock, no random. Tests rely on this for the + audit-artefact-sync gate. + """ + + body_parts = [ + _render_header(doc.frontmatter), + _render_tags(doc.frontmatter), + _render_configs(doc.frontmatter), + _render_file_tree(doc.frontmatter, variant=variant), + _render_readme_body(doc.body), + _render_footer(doc.frontmatter, variant=variant), + ] + return _wrap_html( + title=str(doc.frontmatter.get("pretty_name", "")), + body="\n".join(p for p in body_parts if p), + ) + + +# --------------------------------------------------------------------------- +# Driver — reads inputs, writes HTML, optionally serves +# --------------------------------------------------------------------------- + + +@dataclass(frozen=True) +class PreviewConfig: + """Frozen driver config — built from CLI args or test input.""" + + release_dir: Path + out_dir: Path + port: int + variant: str + open_browser: bool + serve: bool + + +@dataclass(frozen=True) +class PreviewOutcome: + """Return value from :func:`run_preview` — used by tests + CLI.""" + + html_path: Path + cover_path: Path | None + + +def _resolve_cover_image(release_dir: Path, variant: str) -> Path: + """Locate the cover image for the variant. + + The HF packager (PR 5.2) copies the cover image into both + ``release/huggingface/`` and ``release/huggingface-instructor/`` + next to each README. Prefer the variant-tree copy (closest to + the artefact the publish PR will upload); fall back to + ``release_dir`` for the case where the assembler hasn't been + run yet. + """ + + variant_dir = "huggingface" if variant == "public" else "huggingface-instructor" + candidates = [ + release_dir / variant_dir / "dataset-cover-image.png", + release_dir / "dataset-cover-image.png", + ] + for candidate in candidates: + if candidate.is_file(): + return candidate + return candidates[0] + + +def run_preview(config: PreviewConfig) -> PreviewOutcome: + """Render the preview HTML, optionally serve it. + + Pre-flight failures (missing README, malformed YAML, missing + cover image, unknown variant) raise — the CLI converts to rc=2. + """ + + if config.variant not in VALID_VARIANTS: + raise ValueError(f"unknown --variant {config.variant!r}; expected one of {VALID_VARIANTS}") + + readme_path = config.release_dir / _VARIANT_README_REL[config.variant] + if not readme_path.is_file(): + raise FileNotFoundError( + f"HF README not found at {readme_path}; " + f"regenerate via scripts/package_hf_release.py " + f"--variant={config.variant} first" + ) + doc = parse_hf_readme(readme_path.read_text(encoding="utf-8")) + + cover_src = _resolve_cover_image(config.release_dir, config.variant) + if not cover_src.is_file(): + raise FileNotFoundError( + f"cover image not found at {cover_src} (looked in " + f"{config.release_dir}/huggingface{'-instructor' if config.variant == 'instructor' else ''}/ " + f"and {config.release_dir}/)" + ) + + config.out_dir.mkdir(parents=True, exist_ok=True) + html_path = config.out_dir / "index.html" + html_path.write_text(render_hf_html(doc, variant=config.variant), encoding="utf-8") + + cover_dst = config.out_dir / "dataset-cover-image.png" + replace_file(cover_src, cover_dst) + + if config.serve: + _serve(config.out_dir, config.port, open_browser=config.open_browser) + + return PreviewOutcome(html_path=html_path, cover_path=cover_dst) + + +def _serve(directory: Path, port: int, *, open_browser: bool) -> None: + """Start a stdlib HTTP server rooted at ``directory`` and block. + + Same posture as the Kaggle preview — see that module for rationale. + """ + + handler_factory = _make_handler_factory(directory) + url = f"http://localhost:{port}/" + print(f"serving {directory} at {url} — Ctrl-C to stop", file=sys.stderr) + if open_browser: + webbrowser.open(url) + with socketserver.ThreadingTCPServer(("", port), handler_factory) as httpd: + httpd.serve_forever() + + +def _make_handler_factory(directory: Path) -> type[http.server.SimpleHTTPRequestHandler]: + resolved = str(directory.resolve()) + + class _Handler(http.server.SimpleHTTPRequestHandler): + def __init__(self, *args: Any, **kwargs: Any) -> None: + super().__init__(*args, directory=resolved, **kwargs) + + return _Handler + + +# --------------------------------------------------------------------------- +# CLI +# --------------------------------------------------------------------------- + + +def parse_args(argv: Sequence[str] | None = None) -> argparse.Namespace: + """Parse the CLI. Free function so tests can build a Namespace.""" + + parser = argparse.ArgumentParser( + prog="preview_hf_page", + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + parser.add_argument( + "--release-dir", + type=Path, + default=DEFAULT_RELEASE_DIR, + help="release tree containing huggingface[-instructor]/README.md (default: %(default)s)", + ) + parser.add_argument( + "--out-dir", + type=Path, + default=None, + help=( + "where to write the rendered preview " + "(default: release/_preview/huggingface for variant=public, " + "release/_preview/huggingface-instructor for variant=instructor)" + ), + ) + parser.add_argument( + "--port", + type=int, + default=DEFAULT_PORT, + help="port for the local HTTP server (default: %(default)s)", + ) + parser.add_argument( + "--variant", + choices=VALID_VARIANTS, + default="public", + help="public (3-tier) or instructor (companion repo); default: %(default)s", + ) + parser.add_argument( + "--open-browser", + action="store_true", + help="pop a browser tab on the served URL after the page renders", + ) + parser.add_argument( + "--no-serve", + action="store_true", + help="render the HTML and exit; don't start the server (CI / inspection mode)", + ) + return parser.parse_args(argv) + + +def main(argv: Sequence[str] | None = None) -> int: + args = parse_args(argv) + out_dir: Path = args.out_dir or ( + DEFAULT_OUT_DIR_PUBLIC if args.variant == "public" else DEFAULT_OUT_DIR_INSTRUCTOR + ) + config = PreviewConfig( + release_dir=args.release_dir, + out_dir=out_dir, + port=args.port, + variant=args.variant, + open_browser=args.open_browser, + serve=not args.no_serve, + ) + try: + outcome = run_preview(config) + except FileNotFoundError as exc: + print(f"error: {exc}", file=sys.stderr) + return 2 + except ValueError as exc: + print(f"error: {exc}", file=sys.stderr) + return 2 + print(f"wrote {outcome.html_path}", file=sys.stderr) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/preview_kaggle_page.py b/scripts/preview_kaggle_page.py new file mode 100644 index 0000000..ddc4b73 --- /dev/null +++ b/scripts/preview_kaggle_page.py @@ -0,0 +1,608 @@ +#!/usr/bin/env python3 +"""Render an offline mock of the Kaggle dataset page. + +PR 7.2 — middle PR in Phase 7 (LLM critique + publish). Reads the +artefacts the publish PR will upload (``release/kaggle/dataset-metadata.json`` ++ ``release/dataset-cover-image.png``) and renders an HTML page that +mimics the public Kaggle dataset view: header (title / subtitle / +licence / id pill / update-frequency pill), cover image, rendered +description (the inlined README body), file tree of declared +resources, schema/columns tables for every tabular resource, and a +licence + sources footer. + +The page exists for human click-through review BEFORE the maintainer +runs the real ``kaggle datasets create`` upload (PR 7.3). Cached +previews on the live page are expensive to roll back, so the +publish runbook in PR 7.3 cites this script as a required pre-flight. + +The rendered HTML is a deterministic function of the input artefacts +(no ``now()``, no random) — same metadata + cover-image filename → +byte-identical HTML. The committed sample at +``release/_preview_committed/kaggle.html`` is the audit-artefact-sync +gate (mirrors PR 4.1 / 5.1 / 5.2 / 7.1). + +Usage:: + + # Render + serve on http://localhost:8765, pop a browser tab. + python scripts/preview_kaggle_page.py --open-browser + + # Just build the HTML (CI / inspection); no server. + python scripts/preview_kaggle_page.py --no-serve + +Exit codes: 0 success / 2 pre-flight error (missing metadata, +missing cover image, malformed JSON). +""" + +from __future__ import annotations + +import argparse +import http.server +import json +import re +import socketserver +import sys +import webbrowser +from collections.abc import Sequence +from dataclasses import dataclass +from pathlib import Path +from typing import Any, Final + +# Make ``scripts/`` importable regardless of how this file is loaded +# (CLI entrypoint, ``importlib.util.spec_from_file_location`` from +# tests). Mirrors the pattern in ``package_kaggle_release.py``. +sys.path.insert(0, str(Path(__file__).resolve().parent)) + +from _release_common import replace_file # noqa: E402 — must follow sys.path insert + +# --------------------------------------------------------------------------- +# Defaults +# --------------------------------------------------------------------------- + +DEFAULT_RELEASE_DIR: Final[Path] = Path("release") +DEFAULT_OUT_DIR: Final[Path] = Path("release/_preview/kaggle") +DEFAULT_PORT: Final[int] = 8765 + +#: The committed sample HTML used by the audit-artefact-sync test. +#: Located outside ``release/_preview/`` so it is not gitignored. +COMMITTED_SAMPLE_PATH: Final[Path] = Path("release/_preview_committed/kaggle.html") + + +# --------------------------------------------------------------------------- +# Markdown rendering (gated behind the [publish] extra) +# --------------------------------------------------------------------------- + + +def _render_markdown(text: str) -> str: + """Render ``text`` (the inlined README body) to HTML. + + Uses ``markdown-it-py`` in GFM-like mode (tables, fenced code, + autolink, strikethrough) — closest match to how Kaggle renders + its description block. The ``[publish]`` extra (alongside + ``datasets`` / ``kaggle``) is the install path; absent dep + raises a clear instruction rather than a cryptic ``ImportError``. + Footnotes (``[^foo]``) render as literal text, which is faithful + enough — Kaggle does not invest in footnote rendering either. + """ + + try: + from markdown_it import MarkdownIt + except ImportError as exc: # pragma: no cover — gated by extra + raise ImportError( + "markdown-it-py is required for the Kaggle preview page. " + "Install the publish extra: pip install -e '.[publish]'" + ) from exc + # ``gfm-like`` enables linkify by default, which requires the + # separate ``linkify-it-py`` package; we explicitly turn it off so + # the preview does not pull a transitive dep beyond markdown-it-py. + # Tables / fenced code / strikethrough remain on (the bits that + # actually matter for faithful Kaggle/HF rendering). + md = MarkdownIt("gfm-like").disable("linkify") + return md.render(text) + + +# --------------------------------------------------------------------------- +# Tier inference + file tree +# --------------------------------------------------------------------------- + +#: Kaggle's CLI emits resource paths like ``intro/lead_scoring.csv`` — +#: the leading path segment is the tier name. We group resources by +#: this segment so the rendered file tree mirrors the bundle layout +#: the user will see on Kaggle. +_TIER_PATH_RE: Final[re.Pattern[str]] = re.compile(r"^([^/]+)/") + + +def _tier_of(resource_path: str) -> str: + """Return the leading path segment of ``resource_path``, or ``""``. + + Used to bucket resources by tier in the file tree. An empty + string indicates a top-level resource (none of these are emitted + by the Kaggle packager today, but we tolerate them for forward + compatibility). + """ + + match = _TIER_PATH_RE.match(resource_path) + return match.group(1) if match else "" + + +# --------------------------------------------------------------------------- +# Section renderers — pure, deterministic +# --------------------------------------------------------------------------- + + +def _render_header(metadata: dict[str, Any]) -> str: + """Render the page header — title, subtitle, id pill, licence pill.""" + + title = _escape(metadata["title"]) + subtitle = _escape(metadata["subtitle"]) + dataset_id = _escape(metadata["id"]) + license_name = _escape(metadata["licenses"][0]["name"]) if metadata.get("licenses") else "" + update_freq = _escape(metadata.get("expectedUpdateFrequency", "")) + visibility = "Private" if metadata.get("isPrivate") else "Public" + + return f"""
    +
    {dataset_id}
    +

    {title}

    +

    {subtitle}

    +
      +
    • License: {license_name}
    • +
    • Updates: {update_freq}
    • +
    • Visibility: {visibility}
    • +
    +
    """ + + +def _render_cover(cover_image_filename: str) -> str: + """Render the cover-image block. + + The ``src`` is a sibling-relative path so the same HTML works + against both the runtime preview tree (where the image was copied + in) and the committed sample (used for byte-equality only — the + sample is not served). + """ + + src = _escape(cover_image_filename) + return f"""
    + Dataset cover image +
    """ + + +def _render_description(description_md: str) -> str: + """Render the inlined README body as HTML.""" + + body = _render_markdown(description_md) + return f'
    \n{body}
    ' + + +def _render_file_tree(resources: list[dict[str, Any]]) -> str: + """Render the file tree, grouped by tier (leading path segment). + + Inside each tier, files appear in declaration order — matches the + order Kaggle renders the resources column. Each entry is a + monospace path + the resource description. + """ + + by_tier: dict[str, list[dict[str, Any]]] = {} + for resource in resources: + tier = _tier_of(resource["path"]) + by_tier.setdefault(tier, []).append(resource) + + blocks: list[str] = [] + for tier, tier_resources in by_tier.items(): + tier_label = _escape(tier) if tier else "(top-level)" + items: list[str] = [] + for resource in tier_resources: + path = _escape(resource["path"]) + description = _escape(resource.get("description", "")) + items.append( + f'
  • {path}' + f'{description}
  • ' + ) + blocks.append( + f'
    \n' + f' {tier_label}/ ' + f'({len(tier_resources)} files)' + f"\n" + f'
      \n' + "\n".join(items) + "\n
    \n" + "
    " + ) + file_count = len(resources) + return f"""
    +

    Data Files ({file_count} total)

    +{chr(10).join(blocks)} +
    """ + + +def _render_schema_tables(resources: list[dict[str, Any]]) -> str: + """Render one schema/columns table per tabular resource. + + Mimics Kaggle's "Data Card" expandable per-file column listing. + Resources without a ``schema`` (markdown / JSON) are skipped — + same posture as Kaggle. Column count appears in the heading so + the test can assert the table is exhaustive without parsing the + DOM. + """ + + blocks: list[str] = [] + total_columns = 0 + for resource in resources: + schema = resource.get("schema") + if not schema: + continue + fields = schema.get("fields", []) + if not fields: + continue + total_columns += len(fields) + path = _escape(resource["path"]) + rows: list[str] = [] + for fd in fields: + name = _escape(fd.get("name", "")) + ftype = _escape(fd.get("type", "")) + description = _escape(fd.get("description", "")) + rows.append( + f" " + f'{name}' + f'{ftype}' + f'{description}' + f"" + ) + blocks.append( + f'
    \n' + f' {path} ' + f'({len(fields)} columns)' + f"\n" + f' \n' + f" \n" + f" \n" + "\n".join(rows) + "\n \n" + "
    ColumnTypeDescription
    \n" + "
    " + ) + return f"""
    +

    Schema / Columns ({total_columns} columns across {len(blocks)} tabular files)

    +{chr(10).join(blocks)} +
    """ + + +def _render_sources(metadata: dict[str, Any]) -> str: + """Render the user-specified sources block.""" + + sources = metadata.get("userSpecifiedSources", []) or [] + if not sources: + return "" + items = "\n".join( + f'
  • ' + f"{_escape(s['title'])}
  • " + for s in sources + ) + return f"""
    +

    Sources

    +
      +{items} +
    +
    """ + + +def _render_footer(metadata: dict[str, Any]) -> str: + """Render the licence + keywords footer.""" + + keywords = metadata.get("keywords", []) or [] + keyword_chips = " ".join(f'{_escape(k)}' for k in keywords) + license_name = _escape(metadata["licenses"][0]["name"]) if metadata.get("licenses") else "" + return f"""
    + + + +
    """ + + +# --------------------------------------------------------------------------- +# HTML wrapper + minimal Kaggle-ish CSS +# --------------------------------------------------------------------------- + +#: Kept inline rather than served as a separate ``style.css`` so the +#: rendered HTML is a single self-contained file — easier to inspect, +#: easier to byte-compare in the audit-artefact-sync test, and works +#: without a server (open the committed sample directly in a browser). +_PAGE_CSS: Final[str] = """\ +:root { --bg:#fff; --fg:#202124; --muted:#5f6368; --accent:#20beff; --border:#e0e0e0; --pill-bg:#f1f3f4; } +body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; color: var(--fg); background: var(--bg); margin: 0; padding: 0; line-height: 1.5; } +.container { max-width: 1100px; margin: 0 auto; padding: 24px 32px; } +.dataset-header { border-bottom: 1px solid var(--border); padding-bottom: 16px; margin-bottom: 24px; } +.dataset-header__id { color: var(--muted); font-size: 0.85em; font-family: monospace; margin-bottom: 4px; } +.dataset-header__title { font-size: 1.8em; margin: 0 0 4px 0; } +.dataset-header__subtitle { color: var(--muted); margin: 0 0 12px 0; } +.dataset-header__pills { list-style: none; padding: 0; margin: 0; display: flex; flex-wrap: wrap; gap: 8px; } +.pill { background: var(--pill-bg); border-radius: 12px; padding: 4px 12px; font-size: 0.85em; color: var(--fg); } +.cover { margin: 0 0 24px 0; border: 1px solid var(--border); border-radius: 4px; overflow: hidden; } +.cover__image { display: block; max-width: 100%; height: auto; } +.section__heading { font-size: 1.3em; border-bottom: 2px solid var(--accent); padding-bottom: 4px; margin-top: 32px; } +.section__count { color: var(--muted); font-size: 0.7em; font-weight: normal; } +.tier, .schema { border: 1px solid var(--border); border-radius: 4px; padding: 8px 12px; margin: 8px 0; } +.tier__name, .schema__path { cursor: pointer; font-weight: 600; } +.tier__count, .schema__count { color: var(--muted); font-weight: normal; font-size: 0.85em; } +.tier__files { list-style: none; padding: 8px 0 0 0; margin: 0; } +.file { display: flex; gap: 12px; padding: 4px 0; border-bottom: 1px dotted var(--border); } +.file:last-child { border-bottom: none; } +.file__path { color: var(--accent); flex-shrink: 0; } +.file__desc { color: var(--muted); font-size: 0.9em; } +.schema__table { width: 100%; border-collapse: collapse; margin-top: 8px; font-size: 0.9em; } +.schema__table th, .schema__table td { text-align: left; padding: 6px 8px; border-bottom: 1px solid var(--border); vertical-align: top; } +.schema__table th { background: var(--pill-bg); font-weight: 600; } +.col__name code { background: none; } +.col__type { color: var(--muted); font-family: monospace; } +.description { margin: 24px 0; } +.description code { background: var(--pill-bg); padding: 1px 4px; border-radius: 2px; font-size: 0.9em; } +.description pre { background: var(--pill-bg); padding: 12px; border-radius: 4px; overflow-x: auto; } +.description pre code { background: none; padding: 0; } +.description table { border-collapse: collapse; margin: 12px 0; } +.description th, .description td { border: 1px solid var(--border); padding: 6px 10px; text-align: left; } +.description blockquote { border-left: 3px solid var(--accent); padding-left: 12px; color: var(--muted); margin: 12px 0; } +.sources__list { padding-left: 20px; } +.dataset-footer { margin-top: 48px; padding-top: 16px; border-top: 1px solid var(--border); color: var(--muted); font-size: 0.9em; } +.dataset-footer__keywords { margin-bottom: 8px; } +.chip { display: inline-block; background: var(--pill-bg); border-radius: 12px; padding: 2px 10px; margin: 2px; font-size: 0.85em; } +.dataset-footer__note { font-style: italic; margin-top: 8px; } +""" + + +def _wrap_html(*, title: str, body: str) -> str: + """Wrap rendered sections in the page chrome. + + Order: header → cover → description → files → schemas → sources → + footer. Description sits above files because Kaggle leads with + the dataset card on the public page. + """ + + return f""" + + + + Kaggle preview — {_escape(title)} + + + +
    +{body} +
    + + +""" + + +def _escape(value: str) -> str: + """HTML-escape a single attribute / text value. + + Inlined rather than importing ``html.escape`` so the renderer's + surface stays small and the (well-tested) substitution is local + and obvious. + """ + + return ( + str(value) + .replace("&", "&") + .replace("<", "<") + .replace(">", ">") + .replace('"', """) + .replace("'", "'") + ) + + +# --------------------------------------------------------------------------- +# Top-level renderer +# --------------------------------------------------------------------------- + + +def render_kaggle_html(metadata: dict[str, Any], cover_image_filename: str) -> str: + """Render the full Kaggle preview HTML. + + Pure function: same ``(metadata, cover_image_filename)`` → + byte-identical HTML. No I/O, no clock, no random. Tests rely + on this for the audit-artefact-sync gate. + """ + + body_parts = [ + _render_header(metadata), + _render_cover(cover_image_filename), + _render_description(metadata.get("description", "")), + _render_file_tree(metadata.get("resources", [])), + _render_schema_tables(metadata.get("resources", [])), + _render_sources(metadata), + _render_footer(metadata), + ] + return _wrap_html(title=metadata.get("title", ""), body="\n".join(p for p in body_parts if p)) + + +# --------------------------------------------------------------------------- +# Driver — reads inputs, writes HTML, optionally serves +# --------------------------------------------------------------------------- + + +@dataclass(frozen=True) +class PreviewConfig: + """Frozen driver config. + + Mirrors the ``DriverConfig`` posture in + ``scripts/run_llm_critique.py`` — building this from CLI args + keeps the test surface a Python-level call rather than an exec. + """ + + release_dir: Path + out_dir: Path + port: int + open_browser: bool + serve: bool + + +@dataclass(frozen=True) +class PreviewOutcome: + """Return value from :func:`run_preview` — used by tests + CLI.""" + + html_path: Path + cover_path: Path | None + + +def _resolve_cover_image(release_dir: Path, image_name: str) -> Path: + """Locate the cover image referenced by the metadata's ``image``. + + The Kaggle packager (PR 5.1) copies the cover image into + ``release/kaggle/`` next to ``dataset-metadata.json`` AND leaves + the master copy at ``release/dataset-cover-image.png``. We + prefer the kaggle-tree copy (closer to the artefact the publish + PR will upload) and fall back to ``release_dir`` for the + bare-basename case. Returning the resolved path here mirrors + ``_release_common.resolve_cover_image_path`` so the assembler and + inputs cannot disagree. + """ + + candidates = [ + release_dir / "kaggle" / image_name, + release_dir / image_name, + ] + for candidate in candidates: + if candidate.is_file(): + return candidate + return candidates[0] # surface the missing-file error against the canonical location + + +def run_preview(config: PreviewConfig) -> PreviewOutcome: + """Render the preview HTML, optionally serve it. + + Pre-flight failures (missing metadata, malformed JSON, missing + cover image) raise — the CLI converts to rc=2. Validation + discipline mirrors the Phase 5 packagers: build → validate → write. + """ + + metadata_path = config.release_dir / "kaggle" / "dataset-metadata.json" + if not metadata_path.is_file(): + raise FileNotFoundError( + f"Kaggle dataset metadata not found at {metadata_path}; " + f"regenerate via scripts/package_kaggle_release.py first" + ) + metadata = json.loads(metadata_path.read_text(encoding="utf-8")) + if not isinstance(metadata, dict): + raise ValueError(f"{metadata_path} is not a JSON object") + + cover_name = metadata.get("image", "") + if not cover_name: + raise ValueError(f"{metadata_path} declares no 'image' (cover image filename)") + cover_src = _resolve_cover_image(config.release_dir, cover_name) + if not cover_src.is_file(): + raise FileNotFoundError(f"cover image declared as {cover_name!r} not found at {cover_src}") + + config.out_dir.mkdir(parents=True, exist_ok=True) + html_path = config.out_dir / "index.html" + html_path.write_text(render_kaggle_html(metadata, cover_name), encoding="utf-8") + + cover_dst = config.out_dir / cover_name + replace_file(cover_src, cover_dst) + + if config.serve: + _serve(config.out_dir, config.port, open_browser=config.open_browser) + + return PreviewOutcome(html_path=html_path, cover_path=cover_dst) + + +def _serve(directory: Path, port: int, *, open_browser: bool) -> None: + """Start a stdlib HTTP server rooted at ``directory`` and block. + + Uses ``ThreadingHTTPServer`` so the maintainer's browser can fetch + the cover image alongside the HTML without serialising requests. + Block on ``serve_forever()``; KeyboardInterrupt (Ctrl-C) is the + documented exit path. No coverage here — tests exercise the + pure renderer and ``--no-serve`` path; serving is glue that + requires a live socket. + """ + + handler_factory = _make_handler_factory(directory) + url = f"http://localhost:{port}/" + print(f"serving {directory} at {url} — Ctrl-C to stop", file=sys.stderr) + if open_browser: + webbrowser.open(url) + with socketserver.ThreadingTCPServer(("", port), handler_factory) as httpd: + httpd.serve_forever() + + +def _make_handler_factory(directory: Path) -> type[http.server.SimpleHTTPRequestHandler]: + """Build a handler subclass that serves from ``directory``. + + ``SimpleHTTPRequestHandler`` ships a ``directory=`` kwarg in + Python 3.7+, but threading the path through ``socketserver``'s + ``RequestHandlerClass`` requires either a partial or a subclass. + Subclassing keeps the import surface stdlib-only. + """ + + resolved = str(directory.resolve()) + + class _Handler(http.server.SimpleHTTPRequestHandler): + def __init__(self, *args: Any, **kwargs: Any) -> None: + super().__init__(*args, directory=resolved, **kwargs) + + return _Handler + + +# --------------------------------------------------------------------------- +# CLI +# --------------------------------------------------------------------------- + + +def parse_args(argv: Sequence[str] | None = None) -> argparse.Namespace: + """Parse the CLI. Free function so tests can build a Namespace.""" + + parser = argparse.ArgumentParser( + prog="preview_kaggle_page", + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + parser.add_argument( + "--release-dir", + type=Path, + default=DEFAULT_RELEASE_DIR, + help="release tree containing kaggle/dataset-metadata.json (default: %(default)s)", + ) + parser.add_argument( + "--out-dir", + type=Path, + default=DEFAULT_OUT_DIR, + help="where to write the rendered preview (default: %(default)s)", + ) + parser.add_argument( + "--port", + type=int, + default=DEFAULT_PORT, + help="port for the local HTTP server (default: %(default)s)", + ) + parser.add_argument( + "--open-browser", + action="store_true", + help="pop a browser tab on the served URL after the page renders", + ) + parser.add_argument( + "--no-serve", + action="store_true", + help="render the HTML and exit; don't start the server (CI / inspection mode)", + ) + return parser.parse_args(argv) + + +def main(argv: Sequence[str] | None = None) -> int: + args = parse_args(argv) + config = PreviewConfig( + release_dir=args.release_dir, + out_dir=args.out_dir, + port=args.port, + open_browser=args.open_browser, + serve=not args.no_serve, + ) + try: + outcome = run_preview(config) + except FileNotFoundError as exc: + print(f"error: {exc}", file=sys.stderr) + return 2 + except ValueError as exc: + print(f"error: {exc}", file=sys.stderr) + return 2 + print(f"wrote {outcome.html_path}", file=sys.stderr) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/scripts/test_preview_hf_page.py b/tests/scripts/test_preview_hf_page.py new file mode 100644 index 0000000..468ca1c --- /dev/null +++ b/tests/scripts/test_preview_hf_page.py @@ -0,0 +1,444 @@ +"""Tests for ``scripts/preview_hf_page.py`` (PR 7.2). + +Locks the local Hugging Face preview-page contract: + +* required field labels appear in the rendered HTML (pretty_name, + licence, configs, tags) — the four roadmap-mandated HF checks; +* every Markdown link in the README body resolves to a non-404 URL + pattern (no ``](../`` survives, no ``](validation/...)``); +* every ``configs[]`` block in the YAML round-trips through to the + rendered configs dropdown; +* the renderer is byte-deterministic and the committed samples at + ``release/_preview_committed/huggingface_{public,instructor}.html`` + match a fresh regeneration (audit-artefact-sync gate); +* the ``--variant`` flag wires up the right input README, output + dir, and footer label; +* the driver exits with rc=2 on missing artefacts (no live HTTP). + +No network. No live HTTP. +""" + +from __future__ import annotations + +import importlib.util +import re +import sys +from pathlib import Path + +import pytest + +_REPO_ROOT = Path(__file__).resolve().parents[2] +_SCRIPT_PATH = _REPO_ROOT / "scripts" / "preview_hf_page.py" +_spec = importlib.util.spec_from_file_location("preview_hf_page", _SCRIPT_PATH) +assert _spec is not None +assert _spec.loader is not None +preview = importlib.util.module_from_spec(_spec) +sys.modules["preview_hf_page"] = preview +_spec.loader.exec_module(preview) + + +_RELEASE_DIR = _REPO_ROOT / "release" +_PUBLIC_README = _RELEASE_DIR / "huggingface" / "README.md" +_INSTRUCTOR_README = _RELEASE_DIR / "huggingface-instructor" / "README.md" +_PUBLIC_SAMPLE = _REPO_ROOT / "release" / "_preview_committed" / "huggingface_public.html" +_INSTRUCTOR_SAMPLE = _REPO_ROOT / "release" / "_preview_committed" / "huggingface_instructor.html" +_PUBLIC_PRESENT = _PUBLIC_README.exists() +_INSTRUCTOR_PRESENT = _INSTRUCTOR_README.exists() + +# Same allow-list rule as the Kaggle preview tests — see +# ``test_preview_kaggle_page.py`` for rationale. +_LINK_OK_PREFIXES = ( + "https://github.com/leadforge-dev/leadforge", + "https://huggingface.co/datasets/leadforge", + "https://example.com", + "LICENSE", + "#", +) + + +# --------------------------------------------------------------------------- +# Frontmatter parsing +# --------------------------------------------------------------------------- + + +def test_parse_hf_readme_extracts_yaml_and_body() -> None: + text = "---\npretty_name: Test\nlicense: mit\n---\n# Body\n\nText.\n" + doc = preview.parse_hf_readme(text) + assert doc.frontmatter == {"pretty_name": "Test", "license": "mit"} + assert doc.body == "# Body\n\nText.\n" + + +def test_parse_hf_readme_rejects_missing_frontmatter() -> None: + with pytest.raises(ValueError, match="missing a YAML frontmatter"): + preview.parse_hf_readme("# No frontmatter here\n") + + +def test_parse_hf_readme_rejects_non_mapping_frontmatter() -> None: + with pytest.raises(ValueError, match="not a YAML mapping"): + preview.parse_hf_readme("---\n- 1\n- 2\n---\nbody\n") + + +# --------------------------------------------------------------------------- +# Pure-renderer fixtures +# --------------------------------------------------------------------------- + + +def _minimal_doc() -> preview.HuggingFaceDoc: + """A minimum-viable HF doc exercising every renderer branch.""" + + return preview.HuggingFaceDoc( + frontmatter={ + "pretty_name": "TestSet: Mock HF Dataset", + "license": "mit", + "language": ["en"], + "task_categories": ["tabular-classification"], + "size_categories": ["1K None: + html = preview.render_hf_html(_minimal_doc(), variant="public") + assert "TestSet: Mock HF Dataset" in html + assert "License: mit" in html + assert "Task: tabular-classification" in html + assert "Size: 1K<n<10K" in html # HTML-escaped + assert "Language: en" in html + + +def test_render_emits_one_chip_per_tag() -> None: + html = preview.render_hf_html(_minimal_doc(), variant="public") + assert 'b2b' in html + assert 'tabular' in html + + +def test_render_configs_dropdown_lists_every_config() -> None: + """The roadmap-mandated round-trip: every configs[] block from the + YAML appears in the rendered dropdown.""" + + html = preview.render_hf_html(_minimal_doc(), variant="public") + assert "intro" in html + assert "intermediate" in html + assert "(2 configs)" in html + + +def test_render_configs_flags_the_default() -> None: + html = preview.render_hf_html(_minimal_doc(), variant="public") + # The default badge appears next to the default config. + assert 'default' in html + # Exactly one badge instance — no other config gets it. + assert html.count("badge badge--default") == 1 + + +def test_render_data_files_appear_under_each_config() -> None: + html = preview.render_hf_html(_minimal_doc(), variant="public") + assert "intro/train.parquet" in html + assert "intro/valid.parquet" in html + assert "intro/test.parquet" in html + assert "intermediate/train.parquet" in html + + +def test_render_includes_variant_in_footer() -> None: + public = preview.render_hf_html(_minimal_doc(), variant="public") + instructor = preview.render_hf_html(_minimal_doc(), variant="instructor") + assert "Variant: public" in public + assert "Variant: instructor" in instructor + # Variant differences are localised to the footer + file-tree + # heading; the rest of the output is identical. + public_no_variant = public.replace("public", "VARIANT") + instructor_no_variant = instructor.replace("instructor", "VARIANT") + assert public_no_variant == instructor_no_variant + + +def test_render_handles_no_configs_gracefully() -> None: + """Edge case: a malformed dataset card with no ``configs`` should + still render rather than crash.""" + + doc = preview.HuggingFaceDoc( + frontmatter={"pretty_name": "X", "license": "mit"}, + body="body\n", + ) + html = preview.render_hf_html(doc, variant="public") + assert "No configs declared." in html + + +def test_render_escapes_html_in_field_values() -> None: + """Same XSS-safety guard as the Kaggle preview.""" + + doc = preview.HuggingFaceDoc( + frontmatter={"pretty_name": "", "license": "mit"}, + body="body\n", + ) + html = preview.render_hf_html(doc, variant="public") + assert "" not in html + assert "<script>x</script>" in html + + +# --------------------------------------------------------------------------- +# Markdown link resolution (the leakage / link-rewrite regression guard) +# --------------------------------------------------------------------------- + +_HREF_RE = re.compile(r'href="([^"]+)"') + + +@pytest.mark.skipif(not _PUBLIC_PRESENT, reason="public README not present") +def test_public_readme_has_no_unrewritten_relative_links() -> None: + """Same source-side regression guard as the Kaggle preview.""" + + body = _PUBLIC_README.read_text(encoding="utf-8") + assert "](../" not in body, "unrewritten parent-relative link in public README" + assert "](validation/" not in body, "unrewritten validation-relative link in public README" + + +@pytest.mark.skipif(not _PUBLIC_PRESENT, reason="public README not present") +def test_public_rendered_links_point_at_known_targets() -> None: + """Every rendered href in the public preview points at one of the + allow-listed prefixes — anything else would 404 on the live HF + page.""" + + doc = preview.parse_hf_readme(_PUBLIC_README.read_text(encoding="utf-8")) + html = preview.render_hf_html(doc, variant="public") + bad: list[str] = [] + for href in _HREF_RE.findall(html): + if any(href.startswith(prefix) for prefix in _LINK_OK_PREFIXES): + continue + bad.append(href) + assert not bad, f"non-allowlisted hrefs would 404 on HF: {bad[:5]}" + + +@pytest.mark.skipif(not _INSTRUCTOR_PRESENT, reason="instructor README not present") +def test_instructor_rendered_links_point_at_known_targets() -> None: + doc = preview.parse_hf_readme(_INSTRUCTOR_README.read_text(encoding="utf-8")) + html = preview.render_hf_html(doc, variant="instructor") + bad: list[str] = [] + for href in _HREF_RE.findall(html): + if any(href.startswith(prefix) for prefix in _LINK_OK_PREFIXES): + continue + bad.append(href) + assert not bad, f"non-allowlisted hrefs would 404 on HF: {bad[:5]}" + + +@pytest.mark.skipif(not _PUBLIC_PRESENT, reason="public README not present") +def test_public_yaml_configs_round_trip_into_html() -> None: + """Every ``configs[].config_name`` declared in the YAML appears in + the rendered HTML — the round-trip the roadmap mandates.""" + + doc = preview.parse_hf_readme(_PUBLIC_README.read_text(encoding="utf-8")) + html = preview.render_hf_html(doc, variant="public") + for config in doc.frontmatter["configs"]: + name = config["config_name"] + assert f"{name}" in html, ( + f"config {name!r} declared in YAML but missing from rendered HTML" + ) + + +# --------------------------------------------------------------------------- +# Determinism + audit-artefact-sync (against committed samples) +# --------------------------------------------------------------------------- + + +@pytest.mark.skipif(not _PUBLIC_PRESENT, reason="public README not present") +def test_render_is_byte_deterministic() -> None: + doc = preview.parse_hf_readme(_PUBLIC_README.read_text(encoding="utf-8")) + a = preview.render_hf_html(doc, variant="public") + b = preview.render_hf_html(doc, variant="public") + assert a == b + + +@pytest.mark.skipif( + not (_PUBLIC_PRESENT and _PUBLIC_SAMPLE.exists()), + reason="public README or committed sample missing", +) +def test_committed_public_sample_matches_fresh_regeneration() -> None: + """Audit-sync gate for the public variant. + + Regenerate via:: + + python scripts/preview_hf_page.py --no-serve + cp release/_preview/huggingface/index.html \\ + release/_preview_committed/huggingface_public.html + """ + + doc = preview.parse_hf_readme(_PUBLIC_README.read_text(encoding="utf-8")) + fresh = preview.render_hf_html(doc, variant="public") + committed = _PUBLIC_SAMPLE.read_text(encoding="utf-8") + assert fresh == committed + + +@pytest.mark.skipif( + not (_INSTRUCTOR_PRESENT and _INSTRUCTOR_SAMPLE.exists()), + reason="instructor README or committed sample missing", +) +def test_committed_instructor_sample_matches_fresh_regeneration() -> None: + """Audit-sync gate for the instructor variant.""" + + doc = preview.parse_hf_readme(_INSTRUCTOR_README.read_text(encoding="utf-8")) + fresh = preview.render_hf_html(doc, variant="instructor") + committed = _INSTRUCTOR_SAMPLE.read_text(encoding="utf-8") + assert fresh == committed + + +# --------------------------------------------------------------------------- +# Driver — pre-flight error paths (no server start) +# --------------------------------------------------------------------------- + + +def _make_config(release_dir: Path, out_dir: Path, *, variant: str = "public") -> object: + return preview.PreviewConfig( + release_dir=release_dir, + out_dir=out_dir, + port=8766, + variant=variant, + open_browser=False, + serve=False, + ) + + +def test_run_preview_raises_on_unknown_variant(tmp_path: Path) -> None: + fake_release = tmp_path / "release" + fake_release.mkdir() + config = _make_config(fake_release, tmp_path / "preview", variant="bogus") + with pytest.raises(ValueError, match="unknown --variant"): + preview.run_preview(config) # type: ignore[arg-type] + + +def test_run_preview_raises_on_missing_readme(tmp_path: Path) -> None: + fake_release = tmp_path / "release" + fake_release.mkdir() + config = _make_config(fake_release, tmp_path / "preview") + with pytest.raises(FileNotFoundError, match="HF README not found"): + preview.run_preview(config) # type: ignore[arg-type] + + +def test_run_preview_raises_on_malformed_readme(tmp_path: Path) -> None: + fake_release = tmp_path / "release" + (fake_release / "huggingface").mkdir(parents=True) + (fake_release / "huggingface" / "README.md").write_text("# No frontmatter\n", encoding="utf-8") + config = _make_config(fake_release, tmp_path / "preview") + with pytest.raises(ValueError, match="missing a YAML frontmatter"): + preview.run_preview(config) # type: ignore[arg-type] + + +def test_run_preview_raises_on_missing_cover(tmp_path: Path) -> None: + fake_release = tmp_path / "release" + (fake_release / "huggingface").mkdir(parents=True) + (fake_release / "huggingface" / "README.md").write_text( + "---\npretty_name: T\nlicense: mit\n---\nbody\n", encoding="utf-8" + ) + config = _make_config(fake_release, tmp_path / "preview") + with pytest.raises(FileNotFoundError, match="cover image"): + preview.run_preview(config) # type: ignore[arg-type] + + +def test_run_preview_writes_html_and_copies_cover(tmp_path: Path) -> None: + """End-to-end no-serve: HTML lands at out_dir/index.html and the + cover image is copied as a real file.""" + + fake_release = tmp_path / "release" + (fake_release / "huggingface").mkdir(parents=True) + (fake_release / "huggingface" / "README.md").write_text( + "---\npretty_name: T\nlicense: mit\n---\nbody\n", encoding="utf-8" + ) + cover = fake_release / "huggingface" / "dataset-cover-image.png" + cover.write_bytes(b"\x89PNG\r\n\x1a\nfake") + out_dir = tmp_path / "preview" + outcome = preview.run_preview(_make_config(fake_release, out_dir)) # type: ignore[arg-type] + assert outcome.html_path == out_dir / "index.html" + assert outcome.html_path.is_file() + assert outcome.cover_path is not None + assert outcome.cover_path.is_file() + assert not outcome.cover_path.is_symlink() + + +def test_run_preview_instructor_variant_uses_companion_paths(tmp_path: Path) -> None: + """``--variant=instructor`` reads the companion README and writes + to the companion-flavoured out_dir.""" + + fake_release = tmp_path / "release" + (fake_release / "huggingface-instructor").mkdir(parents=True) + (fake_release / "huggingface-instructor" / "README.md").write_text( + "---\npretty_name: I\nlicense: mit\n---\nbody\n", encoding="utf-8" + ) + cover = fake_release / "huggingface-instructor" / "dataset-cover-image.png" + cover.write_bytes(b"\x89PNG\r\n\x1a\nfake") + out_dir = tmp_path / "preview-instructor" + outcome = preview.run_preview( + _make_config(fake_release, out_dir, variant="instructor") # type: ignore[arg-type] + ) + assert outcome.html_path.is_file() + assert "Variant: instructor" in outcome.html_path.read_text(encoding="utf-8") + + +def test_main_returns_2_on_missing_release( + tmp_path: Path, capsys: pytest.CaptureFixture[str] +) -> None: + rc = preview.main( + [ + "--release-dir", + str(tmp_path / "missing"), + "--out-dir", + str(tmp_path / "preview"), + "--no-serve", + ] + ) + assert rc == 2 + captured = capsys.readouterr() + assert "HF README not found" in captured.err + + +def test_main_default_out_dir_depends_on_variant(tmp_path: Path) -> None: + """``--out-dir`` defaults to the variant-flavoured location.""" + + args_public = preview.parse_args(["--no-serve"]) + args_instructor = preview.parse_args(["--no-serve", "--variant=instructor"]) + assert args_public.out_dir is None # resolved in main() + assert args_instructor.out_dir is None + # Sanity: ``main`` resolves the default per variant. + rc = preview.main( + [ + "--release-dir", + str(tmp_path / "missing"), + "--variant=instructor", + "--no-serve", + ] + ) + assert rc == 2 # missing README; we just want to confirm CLI parsing didn't crash + + +def test_parse_args_defaults() -> None: + args = preview.parse_args(["--no-serve"]) + assert args.release_dir == preview.DEFAULT_RELEASE_DIR + assert args.out_dir is None # variant-resolved in main() + assert args.port == preview.DEFAULT_PORT + assert args.variant == "public" + assert args.open_browser is False + assert args.no_serve is True + + +def test_parse_args_rejects_unknown_variant() -> None: + with pytest.raises(SystemExit): + preview.parse_args(["--variant=bogus"]) diff --git a/tests/scripts/test_preview_kaggle_page.py b/tests/scripts/test_preview_kaggle_page.py new file mode 100644 index 0000000..b654a5c --- /dev/null +++ b/tests/scripts/test_preview_kaggle_page.py @@ -0,0 +1,416 @@ +"""Tests for ``scripts/preview_kaggle_page.py`` (PR 7.2). + +Locks the local Kaggle preview-page contract: + +* required field labels appear in the rendered HTML (title, subtitle, + licence, file count, schema column count) — the four roadmap-mandated + Kaggle checks; +* every Markdown link in the inlined description resolves to a + non-404 URL pattern (no ``](../`` survives the rewrite, no + ``](validation/...)`` lives at a relative path on the upload tree); +* the Kaggle schema table lists every CSV / parquet column declared + in ``dataset-metadata.json::resources[].schema.fields``; +* the renderer is byte-deterministic and the committed sample at + ``release/_preview_committed/kaggle.html`` matches a fresh + regeneration (audit-artefact-sync gate, mirrors PR 5.1 / 5.2 / 7.1); +* the driver exits with rc=2 on missing artefacts (no live HTTP). + +No network. No live HTTP. Everything goes through the pure +``render_kaggle_html()`` or the in-process ``run_preview()`` driver. +""" + +from __future__ import annotations + +import importlib.util +import json +import re +import sys +from pathlib import Path + +import pytest + +_REPO_ROOT = Path(__file__).resolve().parents[2] +_SCRIPT_PATH = _REPO_ROOT / "scripts" / "preview_kaggle_page.py" +_spec = importlib.util.spec_from_file_location("preview_kaggle_page", _SCRIPT_PATH) +assert _spec is not None +assert _spec.loader is not None +preview = importlib.util.module_from_spec(_spec) +sys.modules["preview_kaggle_page"] = preview +_spec.loader.exec_module(preview) + + +_RELEASE_DIR = _REPO_ROOT / "release" +_COMMITTED_METADATA = _RELEASE_DIR / "kaggle" / "dataset-metadata.json" +_COMMITTED_COVER = _RELEASE_DIR / "dataset-cover-image.png" +_COMMITTED_SAMPLE = _REPO_ROOT / "release" / "_preview_committed" / "kaggle.html" +_RELEASE_PRESENT = _COMMITTED_METADATA.exists() + +# Allow-listed link patterns the audit-sync test accepts. Anything else +# in the rendered description is a regression — either the source +# README leaked a relative ``../`` link or the GitHub blob rewrite +# stopped firing. The whitelist is intentionally narrow. +_LINK_OK_PREFIXES = ( + "https://github.com/leadforge-dev/leadforge", + "https://huggingface.co/datasets/leadforge", + "https://example.com", # used by unit tests only + "LICENSE", # sibling-relative, resolves under the upload tree + "#", # in-document anchor (footnotes, etc.) +) + + +# --------------------------------------------------------------------------- +# Pure-renderer fixtures +# --------------------------------------------------------------------------- + + +def _minimal_metadata() -> dict[str, object]: + """A minimum-viable metadata payload exercising every renderer + branch (header pills, file tree, schema table, sources, footer).""" + + return { + "title": "TestSet: Lead Scoring Mock", + "id": "testorg/testset-lead-scoring", + "subtitle": "A mock metadata payload exercising the renderer.", + "description": ( + "# Mock dataset\n\n" + "This is a [test link](https://github.com/leadforge-dev/leadforge).\n\n" + "| Col | Notes |\n|---|---|\n| a | b |\n" + ), + "isPrivate": True, + "licenses": [{"name": "MIT"}], + "keywords": ["b2b", "tabular"], + "collaborators": [], + "expectedUpdateFrequency": "never", + "userSpecifiedSources": [ + {"title": "source repo", "url": "https://github.com/leadforge-dev/leadforge"}, + ], + "image": "dataset-cover-image.png", + "resources": [ + { + "path": "intro/lead_scoring.csv", + "description": "Intro flat CSV.", + "schema": { + "fields": [ + {"name": "lead_id", "type": "string", "description": "Opaque id."}, + {"name": "label", "type": "boolean", "description": "Outcome."}, + ] + }, + }, + { + "path": "intro/manifest.json", + "description": "Provenance manifest (no schema).", + }, + ], + } + + +# --------------------------------------------------------------------------- +# Required field labels (one of the four roadmap-mandated Kaggle checks) +# --------------------------------------------------------------------------- + + +def test_render_includes_title_subtitle_id_and_license() -> None: + html = preview.render_kaggle_html(_minimal_metadata(), "dataset-cover-image.png") + assert "TestSet: Lead Scoring Mock" in html + assert "A mock metadata payload exercising the renderer." in html + assert "testorg/testset-lead-scoring" in html + assert "License: MIT" in html + assert "Updates: never" in html + assert "Visibility: Private" in html + + +def test_render_includes_visibility_public_when_not_private() -> None: + metadata = {**_minimal_metadata(), "isPrivate": False} + html = preview.render_kaggle_html(metadata, "dataset-cover-image.png") + assert "Visibility: Public" in html + + +def test_render_file_tree_lists_every_resource_path() -> None: + """File tree shows every resource path declared in metadata.""" + + html = preview.render_kaggle_html(_minimal_metadata(), "dataset-cover-image.png") + assert "intro/lead_scoring.csv" in html + assert "intro/manifest.json" in html + assert "(2 total)" in html # file count appears in the heading + + +def test_render_schema_table_lists_every_column() -> None: + """The schema table lists every column from every tabular resource.""" + + html = preview.render_kaggle_html(_minimal_metadata(), "dataset-cover-image.png") + assert "lead_id" in html + assert "label" in html + assert "Opaque id." in html + assert "(2 columns)" in html # per-table column count + # Resources without a schema (manifest.json) do not appear in the table. + assert "(2 columns across 1 tabular files)" in html + + +def test_render_keywords_appear_as_chips_in_footer() -> None: + html = preview.render_kaggle_html(_minimal_metadata(), "dataset-cover-image.png") + assert 'b2b' in html + assert 'tabular' in html + + +def test_render_sources_block_renders_when_present() -> None: + html = preview.render_kaggle_html(_minimal_metadata(), "dataset-cover-image.png") + assert "source repo" in html + assert 'href="https://github.com/leadforge-dev/leadforge"' in html + + +def test_render_sources_block_omitted_when_empty() -> None: + metadata = {**_minimal_metadata(), "userSpecifiedSources": []} + html = preview.render_kaggle_html(metadata, "dataset-cover-image.png") + assert '

    Sources

    ' not in html + + +def test_render_escapes_html_in_field_values() -> None: + """User-controlled strings are HTML-escaped — guards against XSS + if a recipe ever surfaces ``"} + html = preview.render_kaggle_html(metadata, "dataset-cover-image.png") + assert "" not in html + assert "<script>" in html + + +# --------------------------------------------------------------------------- +# Schema-fields exhaustiveness (audit-style, against committed metadata) +# --------------------------------------------------------------------------- + + +@pytest.mark.skipif(not _RELEASE_PRESENT, reason="release bundles not present") +def test_committed_metadata_schema_is_fully_listed() -> None: + """The roadmap-mandated check: the Kaggle schema table lists every + CSV / parquet column declared in dataset-metadata.json.""" + + metadata = json.loads(_COMMITTED_METADATA.read_text(encoding="utf-8")) + html = preview.render_kaggle_html(metadata, metadata["image"]) + for resource in metadata["resources"]: + schema = resource.get("schema") + if not schema: + continue + for field in schema["fields"]: + name = field["name"] + # Every column name appears as a ```` cell in the table. + assert f"{name}" in html, ( + f"schema column {name!r} from {resource['path']!r} not rendered" + ) + + +# --------------------------------------------------------------------------- +# Markdown link resolution (the leakage / link-rewrite regression guard) +# --------------------------------------------------------------------------- + +#: Match ``href="X"`` in the rendered HTML — markdown-it-py emits +#: double-quoted hrefs. Inline ``](X)`` would slip past this and stay +#: as escaped text rather than a real link, so we also assert against +#: those separately. +_HREF_RE = re.compile(r'href="([^"]+)"') + + +@pytest.mark.skipif(not _RELEASE_PRESENT, reason="release bundles not present") +def test_committed_metadata_description_has_no_unrewritten_relative_links() -> None: + """Source-side regression guard. + + The Kaggle packager runs ``rewrite_release_links()`` on the + inlined README; if a future README adds a ``](../foo)`` link or a + ``](validation/...)`` link AND someone updates the rewriter to + miss it, the rendered description would carry a 404-bound href. + Catch it here, before the publish runbook. + """ + + metadata = json.loads(_COMMITTED_METADATA.read_text(encoding="utf-8")) + description = metadata["description"] + # Source-form check: no parent-relative or validation-relative + # markdown links remain in the inlined description. + assert "](../" not in description, ( + "unrewritten parent-relative markdown link in inlined description" + ) + assert "](validation/" not in description, ( + "unrewritten validation-relative markdown link in inlined description" + ) + + +@pytest.mark.skipif(not _RELEASE_PRESENT, reason="release bundles not present") +def test_committed_metadata_rendered_links_point_at_known_targets() -> None: + """Every rendered href in the description body points at one of: + + * a GitHub blob URL (the rewriter's output); + * a known external service (huggingface.co/datasets/leadforge); + * a sibling-relative path that resolves under the upload tree + (LICENSE), or an in-document anchor (#footnote-1 etc.). + + Anything else is a 404 risk on the live page. + """ + + metadata = json.loads(_COMMITTED_METADATA.read_text(encoding="utf-8")) + html = preview.render_kaggle_html(metadata, metadata["image"]) + bad: list[str] = [] + for href in _HREF_RE.findall(html): + if any(href.startswith(prefix) for prefix in _LINK_OK_PREFIXES): + continue + bad.append(href) + assert not bad, ( + f"rendered HTML carries non-allowlisted hrefs that would 404 on Kaggle: {bad[:5]}" + ) + + +# --------------------------------------------------------------------------- +# Determinism + audit-artefact-sync (against committed sample) +# --------------------------------------------------------------------------- + + +@pytest.mark.skipif(not _RELEASE_PRESENT, reason="release bundles not present") +def test_render_is_byte_deterministic() -> None: + """Two back-to-back renders against the same metadata produce + byte-identical HTML — the determinism contract this script relies + on for the sync test below.""" + + metadata = json.loads(_COMMITTED_METADATA.read_text(encoding="utf-8")) + a = preview.render_kaggle_html(metadata, metadata["image"]) + b = preview.render_kaggle_html(metadata, metadata["image"]) + assert a == b + + +@pytest.mark.skipif( + not (_RELEASE_PRESENT and _COMMITTED_SAMPLE.exists()), + reason="release bundles or committed preview sample missing", +) +def test_committed_sample_matches_fresh_regeneration() -> None: + """The audit-artefact-sync gate. + + A fresh render of the committed Kaggle metadata must equal + ``release/_preview_committed/kaggle.html`` byte-for-byte. If + this fails, either the renderer changed or the upstream metadata + drifted without re-running the preview script. Regenerate via:: + + python scripts/preview_kaggle_page.py --no-serve + cp release/_preview/kaggle/index.html release/_preview_committed/kaggle.html + """ + + metadata = json.loads(_COMMITTED_METADATA.read_text(encoding="utf-8")) + fresh = preview.render_kaggle_html(metadata, metadata["image"]) + committed = _COMMITTED_SAMPLE.read_text(encoding="utf-8") + assert fresh == committed + + +# --------------------------------------------------------------------------- +# Driver — pre-flight error paths (no server start) +# --------------------------------------------------------------------------- + + +def test_run_preview_raises_on_missing_metadata(tmp_path: Path) -> None: + fake_release = tmp_path / "release" + fake_release.mkdir() + config = preview.PreviewConfig( + release_dir=fake_release, + out_dir=tmp_path / "preview", + port=8765, + open_browser=False, + serve=False, + ) + with pytest.raises(FileNotFoundError, match="dataset metadata not found"): + preview.run_preview(config) + + +def test_run_preview_raises_on_malformed_metadata(tmp_path: Path) -> None: + fake_release = tmp_path / "release" + (fake_release / "kaggle").mkdir(parents=True) + (fake_release / "kaggle" / "dataset-metadata.json").write_text( + '"not-an-object"', encoding="utf-8" + ) + config = preview.PreviewConfig( + release_dir=fake_release, + out_dir=tmp_path / "preview", + port=8765, + open_browser=False, + serve=False, + ) + with pytest.raises(ValueError, match="not a JSON object"): + preview.run_preview(config) + + +def test_run_preview_raises_on_missing_cover_image(tmp_path: Path) -> None: + fake_release = tmp_path / "release" + (fake_release / "kaggle").mkdir(parents=True) + (fake_release / "kaggle" / "dataset-metadata.json").write_text( + json.dumps({"image": "missing.png", "resources": []}), encoding="utf-8" + ) + config = preview.PreviewConfig( + release_dir=fake_release, + out_dir=tmp_path / "preview", + port=8765, + open_browser=False, + serve=False, + ) + with pytest.raises(FileNotFoundError, match="cover image"): + preview.run_preview(config) + + +def test_run_preview_writes_html_and_copies_cover(tmp_path: Path) -> None: + """End-to-end no-serve path: HTML lands at ``out_dir/index.html``; + cover image is copied as a real file (not a symlink).""" + + fake_release = tmp_path / "release" + (fake_release / "kaggle").mkdir(parents=True) + cover_src = fake_release / "kaggle" / "dataset-cover-image.png" + cover_src.write_bytes(b"\x89PNG\r\n\x1a\nfake") + (fake_release / "kaggle" / "dataset-metadata.json").write_text( + json.dumps(_minimal_metadata()), encoding="utf-8" + ) + out_dir = tmp_path / "preview" + outcome = preview.run_preview( + preview.PreviewConfig( + release_dir=fake_release, + out_dir=out_dir, + port=8765, + open_browser=False, + serve=False, + ) + ) + assert outcome.html_path == out_dir / "index.html" + assert outcome.html_path.is_file() + assert outcome.cover_path is not None + assert outcome.cover_path.is_file() + assert not outcome.cover_path.is_symlink() + # The HTML references the cover image by sibling-relative name. + assert 'src="dataset-cover-image.png"' in outcome.html_path.read_text(encoding="utf-8") + + +def test_main_returns_2_on_missing_release( + tmp_path: Path, capsys: pytest.CaptureFixture[str] +) -> None: + rc = preview.main( + [ + "--release-dir", + str(tmp_path / "missing"), + "--out-dir", + str(tmp_path / "preview"), + "--no-serve", + ] + ) + assert rc == 2 + captured = capsys.readouterr() + assert "dataset metadata not found" in captured.err + + +def test_parse_args_defaults() -> None: + """``parse_args`` is a free function so tests can exercise the + flag wiring without invoking the full driver.""" + + args = preview.parse_args(["--no-serve"]) + assert args.release_dir == preview.DEFAULT_RELEASE_DIR + assert args.out_dir == preview.DEFAULT_OUT_DIR + assert args.port == preview.DEFAULT_PORT + assert args.open_browser is False + assert args.no_serve is True + + +def test_tier_of_extracts_leading_path_segment() -> None: + """``_tier_of`` is the load-bearing helper that buckets resources + by tier in the file tree — pin its contract.""" + + assert preview._tier_of("intro/lead_scoring.csv") == "intro" + assert preview._tier_of("intermediate/tasks/converted/train.parquet") == "intermediate" + assert preview._tier_of("toplevel.json") == "" From cd36121b6e09c183697b9488281e4518113ce4ce Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Sat, 9 May 2026 10:03:12 +0300 Subject: [PATCH 3/6] PR 7.2 self-review pass 1: socket reuse, dead code, doc accuracy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three findings folded back from the hostile-reviewer first pass: - BUG: switch from socketserver.ThreadingTCPServer to http.server.ThreadingHTTPServer in both preview scripts. The former defaults to allow_reuse_address=False, so Ctrl-C → re-run within ~60s would raise OSError [Errno 48] Address already in use while the socket sat in TIME_WAIT. ThreadingHTTPServer inherits allow_reuse_address=True from HTTPServer. - DEAD CODE: drop COMMITTED_SAMPLE_PATH (Kaggle) and _VARIANT_SAMPLE_PATH (HF) — defined as module-level constants but never read at runtime; tests use their own _REPO_ROOT-rooted paths. Also drop the now-unused socketserver import. - DOC LIE (minor): _resolve_cover_image docstring in the Kaggle script claimed "we prefer the kaggle-tree copy" without acknowledging that release/kaggle/dataset-cover-image.png is gitignored on a fresh checkout. Reworded to call out the lookup order and gitignore reality. Rendered HTML output unchanged; all 48 preview tests still pass; audit-sync samples remain byte-identical. Co-Authored-By: Claude Opus 4.7 --- scripts/preview_hf_page.py | 13 ++++--------- scripts/preview_kaggle_page.py | 31 +++++++++++++++---------------- 2 files changed, 19 insertions(+), 25 deletions(-) diff --git a/scripts/preview_hf_page.py b/scripts/preview_hf_page.py index 51dcebd..e4c6ecc 100644 --- a/scripts/preview_hf_page.py +++ b/scripts/preview_hf_page.py @@ -38,7 +38,6 @@ import argparse import http.server import re -import socketserver import sys import webbrowser from collections.abc import Sequence @@ -62,16 +61,11 @@ DEFAULT_OUT_DIR_INSTRUCTOR: Final[Path] = Path("release/_preview/huggingface-instructor") DEFAULT_PORT: Final[int] = 8766 -#: Per-variant relative paths to the README (under ``release_dir``) -#: and the committed sample HTML (under ``release/_preview_committed/``). +#: Per-variant relative path to the README (under ``release_dir``). _VARIANT_README_REL: Final[dict[str, Path]] = { "public": Path("huggingface/README.md"), "instructor": Path("huggingface-instructor/README.md"), } -_VARIANT_SAMPLE_PATH: Final[dict[str, Path]] = { - "public": Path("release/_preview_committed/huggingface_public.html"), - "instructor": Path("release/_preview_committed/huggingface_instructor.html"), -} VALID_VARIANTS: Final[tuple[str, ...]] = ("public", "instructor") @@ -467,7 +461,8 @@ def run_preview(config: PreviewConfig) -> PreviewOutcome: def _serve(directory: Path, port: int, *, open_browser: bool) -> None: """Start a stdlib HTTP server rooted at ``directory`` and block. - Same posture as the Kaggle preview — see that module for rationale. + Same posture as the Kaggle preview — see that module for the + ``allow_reuse_address`` rationale. """ handler_factory = _make_handler_factory(directory) @@ -475,7 +470,7 @@ def _serve(directory: Path, port: int, *, open_browser: bool) -> None: print(f"serving {directory} at {url} — Ctrl-C to stop", file=sys.stderr) if open_browser: webbrowser.open(url) - with socketserver.ThreadingTCPServer(("", port), handler_factory) as httpd: + with http.server.ThreadingHTTPServer(("", port), handler_factory) as httpd: httpd.serve_forever() diff --git a/scripts/preview_kaggle_page.py b/scripts/preview_kaggle_page.py index ddc4b73..8e37011 100644 --- a/scripts/preview_kaggle_page.py +++ b/scripts/preview_kaggle_page.py @@ -39,7 +39,6 @@ import http.server import json import re -import socketserver import sys import webbrowser from collections.abc import Sequence @@ -62,10 +61,6 @@ DEFAULT_OUT_DIR: Final[Path] = Path("release/_preview/kaggle") DEFAULT_PORT: Final[int] = 8765 -#: The committed sample HTML used by the audit-artefact-sync test. -#: Located outside ``release/_preview/`` so it is not gitignored. -COMMITTED_SAMPLE_PATH: Final[Path] = Path("release/_preview_committed/kaggle.html") - # --------------------------------------------------------------------------- # Markdown rendering (gated behind the [publish] extra) @@ -443,14 +438,12 @@ class PreviewOutcome: def _resolve_cover_image(release_dir: Path, image_name: str) -> Path: """Locate the cover image referenced by the metadata's ``image``. - The Kaggle packager (PR 5.1) copies the cover image into - ``release/kaggle/`` next to ``dataset-metadata.json`` AND leaves - the master copy at ``release/dataset-cover-image.png``. We - prefer the kaggle-tree copy (closer to the artefact the publish - PR will upload) and fall back to ``release_dir`` for the - bare-basename case. Returning the resolved path here mirrors - ``_release_common.resolve_cover_image_path`` so the assembler and - inputs cannot disagree. + Lookup order: ``release/kaggle/`` (assembled + upload-tree copy, present after the maintainer runs the Kaggle + packager — gitignored, so absent on a fresh checkout) → + ``release/`` (the committed master copy). Returning + the resolved path here mirrors ``_release_common.resolve_cover_image_path`` + so the assembler and inputs cannot disagree. """ candidates = [ @@ -504,9 +497,15 @@ def run_preview(config: PreviewConfig) -> PreviewOutcome: def _serve(directory: Path, port: int, *, open_browser: bool) -> None: """Start a stdlib HTTP server rooted at ``directory`` and block. - Uses ``ThreadingHTTPServer`` so the maintainer's browser can fetch + Uses ``http.server.ThreadingHTTPServer`` so the browser can fetch the cover image alongside the HTML without serialising requests. - Block on ``serve_forever()``; KeyboardInterrupt (Ctrl-C) is the + ``ThreadingHTTPServer`` (unlike bare ``socketserver.ThreadingTCPServer``) + inherits ``allow_reuse_address = True`` from ``HTTPServer`` — + matters because Ctrl-C → re-run within ~60s would otherwise + raise ``OSError: [Errno 48] Address already in use`` while the + socket sits in TIME_WAIT. + + Blocks on ``serve_forever()``; KeyboardInterrupt (Ctrl-C) is the documented exit path. No coverage here — tests exercise the pure renderer and ``--no-serve`` path; serving is glue that requires a live socket. @@ -517,7 +516,7 @@ def _serve(directory: Path, port: int, *, open_browser: bool) -> None: print(f"serving {directory} at {url} — Ctrl-C to stop", file=sys.stderr) if open_browser: webbrowser.open(url) - with socketserver.ThreadingTCPServer(("", port), handler_factory) as httpd: + with http.server.ThreadingHTTPServer(("", port), handler_factory) as httpd: httpd.serve_forever() From 6f957dd4ceffb4262e63053eac052036c41e0292 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Sat, 9 May 2026 10:10:01 +0300 Subject: [PATCH 4/6] PR 7.2: update .agent-plan.md with closed-entry summary Replaces the open PR 7.2 stub with the dense-summary closed entry mirroring PR 7.1's structure: every load-bearing decision, both self-review passes, and the validation-suite numbers inline. Co-Authored-By: Claude Opus 4.7 --- .agent-plan.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.agent-plan.md b/.agent-plan.md index 4f4c290..47c1e6c 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -64,7 +64,7 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family ### Phase 7 — LLM critique + publish (3 PRs) - [x] PR 7.1: LLM critique module + prompt + driver landed. `leadforge/validation/llm_critique.py` (new) — single-provider Anthropic critique core via an `LLMCritiqueClient` protocol (no preemptive OpenAI/Gemini stubs); `_AnthropicCritiqueClient` lazy-imports the SDK so the module imports cleanly even on machines without `anthropic` installed (the skip-cleanly path needs to work without the SDK). `has_anthropic_credentials` / `api_key_or_skip` treat unset and empty-after-strip identically as "absent", explicitly to handle the `env -i` / stale `.envrc` case where the shell sets `ANTHROPIC_API_KEY=""` and the SDK would otherwise 401 instead of cleanly skipping. Default model `claude-opus-4-7` with `thinking={"type": "adaptive", "display": "summarized"}` (only mode supported on Opus 4.7 — manual `budget_tokens` 400s) and `output_config={"effort": "high"}` (recommended minimum for intelligence-sensitive work per the `claude-api` skill); two prompt-cache breakpoints (rubric + input bundle) per the design doc's caching strategy so the common adjudication-loop workflow hits cache on both layers; streamed via `messages.stream(...).get_final_message()` to dodge the 10-min idle-connection timeout on long adaptive-thinking responses. `build_input_bundle` is pure (same `release_dir` → byte-identical bytes → identical `sha256`) and assembles eleven blocks: `release/README.md`, per-tier `dataset_card.md`, `docs/release/generation_method.md`, `manifest.json`, `feature_dictionary.csv`, `validation_report.{md,json}`, the first 100 test-split rows rendered as deterministic CSV, the public/instructor diff summary (live-derived from the `BANNED_LEAD_COLUMNS` / `BANNED_OPP_COLUMNS` / `BANNED_TABLES` / `SNAPSHOT_FILTERED_TABLES` constants in `leakage_probes.py` — single source of truth, auto-stays-in-sync, sync-tested), the public-safe mechanism summary (motif family **names** + difficulty knob **names**, never values — same redaction posture as `student_public`), and the break-me guide verbatim ("avoid re-deriving" the existing nine patterns). `parse_critique_response` schema-validator pins eleven malformations (missing required field, wrong severity, wrong category, wrong rubric dimension, finding-id collision, findings non-list, top-level non-object, non-JSON, score out of range, defensive code-fence stripping, empty findings list valid) and returns every problem in one error rather than the first one. Output schema is a frozen dataclass (no pydantic dependency) with the nine-value `category` vocabulary lifted **verbatim** from `break_me_guide.md` so findings route to existing issue-template labels without translation; `rubric_dimension: str` is required on every finding (D1-D14) so reviewers can audit clustering. Provenance triple (`model` / `effort` / `thinking_mode`) plus per-source-file `bundle_hashes` and the assembled `input_bundle_sha256` are carried on every result for audit-artifact-sync — re-runs on the same RC produce the same bundle hashes. `docs/release/llm_critique_prompt.md` (new) — the rubric document the driver feeds to Claude, parseable via `` / `` section markers with surrounding prose ignored; fourteen rubric dimensions (D1 documentation truthfulness · D2 leakage discipline · D3 realism vs disclosure · D4 difficulty signal · D5 calibration / value-aware ranking · D6 cohort/time-window discipline · D7 notebook integrity · D8 platform packaging hygiene · D9 adversarial-framing completeness · D10 pedagogy of the documented `total_touches_all` trap · D11 effective semantic diversity per recommendation #12 v1 scope · D12 Datasheets-for-Datasets composition · D13 manifest/provenance integrity · D14 out-of-scope guard). Severity calibration explicitly written to discourage padding the report with low-severity nits and to surface "no high-severity findings" as a positive signal vs "the critique didn't surface any". `scripts/run_llm_critique.py` (new) — driver mirroring `validate_release_candidate.py`'s posture (free-function `parse_args`, frozen `DriverConfig`, `run_critique(config) -> DriverResult`, `main(argv)` returning an exit code). Skip-cleanly path triggers BEFORE any I/O — no rubric read, no bundle build, no out-dir creation; tested explicitly with `not (tmp_path / "out").exists()` after the skip. Three modes alongside the live path: `--dry-run` writes the rendered input bundle to `/llm_critique_input_.md` for human inspection (different filename from the real raw JSON, can't be confused); `--no-execute` calls `api_key_or_skip` + `build_anthropic_client()` to prove the SDK is installed and creds are present without burning an API call (CI smoke); `--out-tag` suffixes the raw filename so adjudication re-runs don't shadow the canonical run. Outputs: timestamped `llm_critique_raw_.json` (accumulates per run, no clobber) + canonical `llm_critique_summary.md` (overwritten in place so dataset-card links don't rot). Exit codes mirror `validate_release_candidate.py`: 0 pass (skip-cleanly counts as pass), 1 high-severity surfaced and unresolved, 2 pre-flight error or schema-validation failure (every problem rendered to stderr, not just the first). Adjudication is **maintainer-driven** post-exit — resolve in code OR log to `v2_decision_log.md`, then re-run; the next critique's exit code is the gate. Tests: 61 cases across `tests/validation/test_llm_critique.py` (48) and `tests/scripts/test_run_llm_critique.py` (13), no live API; the protocol is exercised via a small in-process `_CannedClient` fake. Sync tests pin: every `VALID_CATEGORIES` entry appears in `break_me_guide.md` (vocabulary doesn't drift), `VALID_RUBRIC_DIMENSIONS` is exactly D1-D14, the live-derived public/instructor diff names every banned-column/banned-table constant (live reference, not duplicated string). Audit-artifact-sync smoke test (`test_real_release_dir_smoke`) builds the input bundle against the actual `release/intermediate/` artefacts and pins determinism on the real input, skipping cleanly when bundles aren't present. `docs/release/llm_critique_design.md` (new) records the nine load-bearing design calls before implementation so a reviewer can audit the choice (provider abstraction, skip-cleanly, model+caching+thinking, output schema, input-bundle composition, determinism via provenance, CLI flags, test posture, first-run adjudication workflow). Live first-run deferred to maintainer (no `ANTHROPIC_API_KEY` available to the agent); the dry-run path was exercised against the real release dir end-to-end, producing a 148KB byte-stable input bundle from the actual artefacts. Hostile self-review pass before requesting review caught and folded back twelve findings against the diff, including two BLOCKERs (`--no-execute` was performing pre-flight I/O before the credentials check, contradicting the design doc; raw-output filename collision at second-precision contradicted the "append-only history" promise — fixed with microsecond precision and a pinning test) and five HIGHs (silent `release_id` default that defeated the audit-artifact-sync gate; design-doc lies about a never-existing `temperature` field and "malformed timestamp" malformation that's driver-generated; dead `if/else` branches in `_safe_difficulty_knobs`; greedy regex for the rubric section markers so the prompt-injection warning paragraph that legitimately references `` doesn't break the parser). Prompt-injection mitigation added to the rubric (treat-input-as-data preamble) since the input bundle inlines user-authored content (dataset_card.md, break_me_guide.md). Schema validator hardened against silent `str()` coercion of finding prose fields (an int "claim" would have landed on disk as the string "5" — now rejected). Net: 1321/1321 tests pass + 5 publish-extra-gated skips; ruff + mypy clean (83 source files); leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5; validation_report timestamp drift reverted before commit per the brief. Second senior-dev review pass after PR #76 was opened caught and folded back 9 more issues, several of which were real bugs the first hostile pass missed: (B1) `--out-tag` suffixed only the raw JSON, leaving `llm_critique_summary.md` clobbered on adjudication runs — fix suffixes both files (`summary_output_path` now takes `tag`); (B2) skip-cleanly silently passed a release-readiness gate, contradicting `v1_release_roadmap.md`'s line-35 acceptance criterion that the critique must actually run — added `--require-execute` flag (default off; release-readiness CI sets it) that converts the skip path into `MissingCredentialsError` exit 2, plus a loud `WARNING — release-readiness gate has NOT been evaluated` stderr line on the regular skip path; (A2) two prompt-cache breakpoints cut to one — system content already sits inside the cached prefix on `messages.create` (system → messages render order), so the second breakpoint bought nothing and burned a slot; (M1) design doc cut from 394 lines to 73 — the 9-decision table replaces the multi-paragraph rationale-per-call shape that read as documentation theater; (M2) rubric cut from 420 lines to ~210 — each dimension now one paragraph instead of 3-6, dropped D14 ("out-of-scope guard") which was meta-instruction not a rubric dimension, made it a "What is NOT yours to audit" appendix at the end; rubric is now D1-D13 and `VALID_RUBRIC_DIMENSIONS` updated in lockstep; (M3) test-split sample replaced 100 raw rows of CSV with `df.describe(include="all")` per-column statistics + a 20-row head — distributional conclusions need statistics not raw rows, and the rendered input bundle dropped from 148KB to 128KB; (M5) streaming-via-`messages.stream` replaced with `messages.create(timeout=600.0)` — no stream events were processed anyway, the contract is just "don't time out on long adaptive-thinking responses" and an explicit timeout is the right way to spell that; (M6) `render_input_bundle_text` free function moved to `InputBundle.render()` method — leaky abstraction; the audit-artifact-sync framing was misleading (no committed-artefact diff) and was renamed to "smoke test against the real release dir" / "staleness check vs committed result" throughout the module and design doc. Net after the second pass: 1323/1323 tests pass + 5 publish-extra-gated skips; ruff + mypy clean; leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5; validation_report timestamp drift reverted again before this commit. First live critique run executed by the maintainer with a dedicated Anthropic project key (`leadforge-llm-critique-v1-prod`): score 7/10, six findings (1 high, 4 medium, 1 low), exit code 1 as designed for unresolved high-severity findings. Adjudication: F001 high-severity (93 % `account_id` overlap between train/test documented only in break_me_guide §5, missing from README/dataset_card) — **resolved in code** by adding a "Group-leakage warning" paragraph to `release/README.md` "Splits" subsection citing the 518/557 figure and a `GroupKFold(account_id)` recipe; the parallel disclosure on the auto-rendered `dataset_card.md` is logged as `accepted-for-v2` because the renderer change is out of scope for PR 7.1's no-bundle-regen rule. F004 medium (break_me_guide pattern 5 covered `account_id` but not `contact_id`, despite contacts being shared across the lead-keyed split at the same magnitude) — **resolved in code** by extending §5 to enumerate both keys and any reusable foreign-key column as group-leakage axes. F006 low (README "Conversion rate (recipe band)" column header didn't make clear it was a recipe-acceptance window not an observed range) — **resolved in code** by renaming to "(acceptance band, gate G7.\*)" and adding a one-sentence note that observed five-seed spreads sit comfortably inside the band. F002 medium (Gaussian noise produces non-physical values: negative ACV, negative day-deltas, day-deltas > snapshot_day=30, undisclosed in dataset card) — `accepted-for-v2`; requires `leadforge/narrative/dataset_card.py` change. F003 medium (`](../foo)` relative links would 404 on Kaggle/HF) — `wont-fix`: already treated by `scripts/_release_common.py::rewrite_release_links()` which both platform packagers (PR 5.1, 5.2) call at packaging time; the LLM didn't have visibility into the platform packagers and made a wrong inference. F005 medium (advanced-tier `calibration_max_bin_error = 0.5234` driven by an n=2 high-probability bin, no minimum-bin-count footnote) — `accepted-for-v2`; not a 1-line change, touches `release_quality.py` metric definition and would require regenerating `validation_report.{json,md}` which PR 7.1's brief explicitly forbids. Three missing-section callouts (Datasheets §Biases, §Privacy, per-bundle group-split warning) and three maintainer questions (noise/windowing interaction, `top_decile_rate` naming, Kaggle/HF docs subtree) all logged to `docs/release/v2_decision_log.md`. README edits cascaded into the platform packager artefacts; `release/kaggle/dataset-metadata.json` and `release/huggingface/README.md` regenerated cleanly via the existing packagers (`scripts/package_{kaggle,hf}_release.py`). Critique run output committed to `release/validation/llm_critique_raw_20260508T204359.124834Z.json` + `release/validation/llm_critique_summary.md`. Final net: 1325/1325 tests pass + 5 publish-extra-gated skips; ruff + mypy clean (83 source files); leakage probes 0/3 on every tier; hash determinism PASS 67/67; `validate_release_candidate --no-rebuild` exits 0; `BUNDLE_SCHEMA_VERSION` unchanged at 5. Phase 7 PR 7.1 closed; PR 7.2 (local Kaggle/HF mock-page preview) is next. -- [ ] **PR 7.2** — local Kaggle + HuggingFace mock-page preview tooling (must land before PR 7.3): `scripts/preview_kaggle_page.py` and `scripts/preview_hf_page.py` render offline HTML mocks of the public Kaggle and HF dataset pages from the *exact* upload artefacts (metadata JSON, README, cover image), serve over `localhost`, and let the maintainer click through both pages in a browser before any platform upload — catches styling / link / YAML-rendering issues before they hit cached previews on the live page. Tests cover required-field presence, link resolution, schema column listing, configs-block round-trip. +- [x] PR 7.2: local Kaggle + HuggingFace mock-page preview tooling landed. `scripts/preview_kaggle_page.py` (new) — reads the *exact* artefacts the publish PR will upload (`release/kaggle/dataset-metadata.json` + the inlined README body + the cover image, prefer `release/kaggle/dataset-cover-image.png` then fall back to the gitignore-resilient `release/dataset-cover-image.png` master copy) and renders an offline HTML page mocking the public Kaggle dataset view: header (title / subtitle / id pill / licence / update-frequency / visibility), cover image, rendered description (the inlined README body), file tree of declared resources grouped by tier with per-tier counts, schema/columns table for every tabular resource (`resources[].schema.fields[].name/type/description`) with per-table column counts in the heading, user-specified-sources block (rendered only when present), keywords + licence footer. Serves on `http://localhost:8765` via stdlib `http.server.ThreadingHTTPServer` (the threading variant inherits `allow_reuse_address=True` from `HTTPServer`, so Ctrl-C → re-run within ~60s does not raise `OSError [Errno 48] Address already in use` while the socket sits in TIME_WAIT — caught and folded back in self-review pass 1, the initial draft used `socketserver.ThreadingTCPServer` which defaults to `False`). `--no-serve` builds the HTML and exits (CI / inspection mode); `--open-browser` pops a tab on startup; `--port` / `--release-dir` / `--out-dir` round out the surface. `scripts/preview_hf_page.py` (new) — reads `release/huggingface/README.md` (or `release/huggingface-instructor/README.md` per `--variant=public|instructor`) and parses YAML frontmatter + Markdown body via a single anchored regex (`r"\A---\n(?P.*?)\n---\n(?P.*)\Z"` with `re.DOTALL`); renders the analogous HF view: header pills (pretty_name + license + task_categories + size_categories + language), tag chips, configs dropdown (one details-block per `configs[]` entry with the default config flagged via a single `badge--default` instance, data_files split→path table per config), file tree of declared YAML paths bucketed by config, README body, footer carrying the variant for human visual confirmation. `--variant` defaults `--out-dir` to `release/_preview/huggingface/` (public) or `release/_preview/huggingface-instructor/` (instructor); the instructor path also reads its README from a different location (`huggingface-instructor/README.md`) and looks for the cover under the variant directory first. Both scripts share the validation discipline from the Phase 5 packagers: build → validate → write; pre-flight failures (missing metadata, malformed JSON / YAML, unknown variant, missing cover) raise and the CLI converts to rc=2 without touching disk; runtime success exits 0. Markdown rendering via `markdown-it-py` in `gfm-like` preset (tables / fenced code / strikethrough on; `linkify` explicitly disabled so the optional `linkify-it-py` transitive dep is not required); the dep is added to the `[publish]` extra alongside `datasets` / `kaggle` (mirrors the PR 5.1 / 5.2 gating posture for publish-pipeline tooling), and absent imports raise a clean `ImportError` pointing at `pip install -e ".[publish]"` instead of a cryptic stdlib `ModuleNotFoundError`. Both renderers are pure: same `(metadata|doc, cover_filename|variant)` → byte-identical HTML (no `now()`, no random, no clock). Output landing at `release/_preview//index.html` is gitignored (`.gitignore` adds `release/_preview/`); the audit-artefact-sync gate lives at `release/_preview_committed/{kaggle,huggingface_public,huggingface_instructor}.html` (committed alongside the scripts, mirrors the PR 4.1 / 5.1 / 5.2 / 7.1 audit-sync pattern). HTML is wrapped in a single self-contained file (CSS inlined, no external stylesheet) so each committed sample is human-inspectable directly from `git show` or a browser without a server. XSS-safety: every user-controlled string passes through a hand-rolled `_escape` (`&`, `<`, `>`, `"`, `'`); kept hand-rolled rather than `html.escape` so the committed samples' `'` (decimal) escapes don't churn against `html.escape`'s `'` (hex) entity. Tests: 48 cases across `tests/scripts/test_preview_kaggle_page.py` (20) and `tests/scripts/test_preview_hf_page.py` (28); no live HTTP, no network, no socket open. The four roadmap-mandated checks per script: required field labels appear in rendered HTML (Kaggle: title / subtitle / id / license / file count / schema column count; HF: pretty_name / license / configs / tags); every Markdown link in the source resolves to a non-allowlisted URL pattern fails the test (allow-list: `https://github.com/leadforge-dev/leadforge`, `https://huggingface.co/datasets/leadforge`, sibling-relative `LICENSE`, in-document `#` anchors — anything else is a 404 risk on the live page); the Kaggle schema table lists every column declared in `resources[].schema.fields` (iterates the committed metadata, asserts each `{name}` appears); every `configs[]` block in the HF YAML round-trips into the rendered dropdown. Determinism is double-tested: `test_render_is_byte_deterministic` runs two passes against the real release artefact and pins equality; `test_committed_*_sample_matches_fresh_regeneration` pins the committed HTML against fresh regeneration byte-for-byte (the audit-sync gate). Pre-flight error paths exercised end-to-end: missing artefact (`FileNotFoundError`), malformed JSON / YAML (`ValueError`), unknown variant, missing cover image — all return rc=2 via `main()` with informative stderr. HTML escape coverage: `test_render_escapes_html_in_field_values` asserts a `