leadforge-dev · shaypal5 · May 6, 2026 · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/.agent-plan.md b/.agent-plan.md
@@ -46,9 +46,17 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
 - [x] PR 4.1: `release/README.md` (substantial rewrite) — release-grade dataset card per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1). New sections: macro framing paragraph (2024–2026 SaaS context, recommendation #19), simulation simplifications (modelled / approximate / not modelled, per chatgpt v2 §2.6), calibration documentation linking to `release/validation/validation_report.md`, public-vs-instructor redaction policy with concrete column lists citing `BANNED_LEAD_COLUMNS` / `BANNED_OPP_COLUMNS` / `BANNED_TABLES` / `SNAPSHOT_FILTERED_TABLES` from `leadforge/validation/leakage_probes.py`, intended-use vs out-of-scope-use, known limitations (G7.4.4 GBM−LR sign finding, weak channel signal from the Phase 4 audit, flat AUC across tiers, small cohort-shift gap), composition section per Datasheets format, adversarial-framing pointer (placeholder link to `docs/release/break_me_guide.md` that lands in PR 6.3), and a maintenance plan. Every realism / calibration / difficulty claim in the card is anchored to `validation_report.md` per G10.6. `BUNDLE_SCHEMA_VERSION` unchanged at 5 (documentation-only PR); 1167/1167 tests pass; ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0.
 
 ### Phase 5 — Platform packaging
-- [ ] `scripts/package_kaggle_release.py` → `release/kaggle/dataset-metadata.json`
+- [x] PR 5.1: `scripts/package_kaggle_release.py` (new) — Kaggle dry-run packager. Reads the
+  three public release bundles, validates Kaggle metadata constraints (`title`, `subtitle`, slug
+  `id`, single MIT license, `expectedUpdateFrequency=never`, ordered resource schemas,
+  user-specified sources, keywords, and image), writes `release/kaggle/dataset-metadata.json`, and
+  assembles `release/kaggle/` with the public tier files, README, license, and cover image. Cover
+  source decision: reproducible hand-designed funnel diagram generated by the packager, avoiding
+  stock licensing risk while keeping the asset deterministic. Tests cover metadata validation,
+  ordered schema fields, cover dimensions, upload-dir shape, CLI error handling, and byte-identical
+  metadata regeneration against the committed artifact.
 - [ ] `scripts/package_hf_release.py` → `release/huggingface/README.md` with YAML configs/default/pretty_name/tags
-- [ ] `release/dataset-cover-image.png` (≥560×280)
+- [x] `release/dataset-cover-image.png` (≥560×280)
 - [ ] Local `load_dataset()` smoke test; Kaggle dry-run package validation
 
 ### Phase 6 — Notebook sequence + adversarial framing

diff --git a/pyproject.toml b/pyproject.toml
@@ -40,10 +40,12 @@ dev = [
     "types-pyyaml>=6.0",
     "scikit-learn>=1.3",
     "matplotlib>=3.7",
+    "pillow>=10.0",
 ]
 scripts = [
     "scikit-learn>=1.3",
     "matplotlib>=3.7",
+    "pillow>=10.0",
 ]
 
 [project.scripts]

diff --git a/release/dataset-cover-image.png b/release/dataset-cover-image.png
diff --git a/release/kaggle/LICENSE b/release/kaggle/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2026 leadforge-dev
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/release/kaggle/README.md b/release/kaggle/README.md
@@ -0,0 +1,233 @@
+# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)
+
+A relational, reproducible, three-tier synthetic CRM dataset family for
+teaching lead scoring at scale. Generated by
+[leadforge](https://github.com/leadforge-dev/leadforge), an
+open-source Python framework for synthetic CRM/funnel data. The
+framework version is decoupled from the dataset version: the package
+stays at `1.x`; the dataset is published under the explicit `…-v1`
+tag.
+
+## Why lead scoring matters in 2024–2026
+
+Mid-market SaaS vendors entered 2024–2026 with growth slowing and
+customer-acquisition costs rising[^macro], so predicting *which* leads
+convert within a fixed window has moved from a marketing nicety to a
+survival skill. This dataset teaches that skill on a relational
+substrate, with the realistic confusions (snapshot-window discipline,
+leakage traps, channel signal weaker than vendor blogs imply) that
+students will hit when they finally get hands on real CRM data.
+
+[^macro]: Macroeconomic framing summarised in
+[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)
+(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio
+rose materially in 2024).
+
+## What's inside
+
+```
+.
+├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier
+│   ├── manifest.json                 # provenance + file hashes
+│   ├── dataset_card.md               # auto-rendered per-bundle card
+│   ├── feature_dictionary.csv        # authoritative column spec
+│   ├── lead_scoring.csv              # flat convenience CSV (all splits)
+│   ├── tables/*.parquet              # 7 snapshot-safe relational tables
+│   └── tasks/converted_within_90_days/{train,valid,test}.parquet
+├── dataset-metadata.json             # Kaggle dataset metadata
+├── dataset-cover-image.png           # Kaggle cover image
+├── README.md                         # Kaggle package README
+└── LICENSE
+```
+
+`student_public` bundles ship the snapshot-safe relational view;
+`research_instructor` companions ship the full-horizon view plus the
+hidden causal structure (DAG, latent registry, mechanism summary)
+under `metadata/`. The full layout is documented in each bundle's
+`manifest.json`.
+
+## Quick start
+
+```python
+# Flat CSV
+df = pd.read_csv("intermediate/lead_scoring.csv")
+
+# Parquet task splits (recommended)
+train = pd.read_parquet("intermediate/tasks/converted_within_90_days/train.parquet")
+test  = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet")
+
+# Relational tables (feature engineering — example)
+leads   = pd.read_parquet("intermediate/tables/leads.parquet")
+touches = pd.read_parquet("intermediate/tables/touches.parquet")
+my_touch_count = (
+    touches.groupby("lead_id").size().rename("my_touch_count").reset_index()
+)
+features = leads.merge(my_touch_count, on="lead_id", how="left")
+
+# Reproduce from source
+# pip install leadforge
+# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \
+#                    --mode student_public --difficulty intermediate --out my_bundle
+```
+
+The label `converted_within_90_days` resolves over a 90-day window;
+engagement features (`touch_count`, `session_count`, etc.) are
+computed strictly over events on days `[0, 30]`. The deliberate
+exception is `total_touches_all`, the leakage trap — flagged
+`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your
+feature set unless you're demonstrating leakage detection.
+
+## Dataset summary
+
+| | Intro | Intermediate | Advanced |
+|---|---|---|---|
+| Leads | 5,000 | 5,000 | 5,000 |
+| Accounts | 1,500 | 1,500 | 1,500 |
+| Contacts | 4,200 | 4,200 | 4,200 |
+| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |
+| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |
+| Conversion rate (recipe band) | 24–61% | 12–31% | 4–12% |
+| Conversion rate (median, seeds 42–46) | 42.67% | 21.60% | 8.40% |
+| Signal strength | 0.90 | 0.70 | 0.50 |
+| Noise scale | 0.10 | 0.30 | 0.55 |
+| Missing rate | 2% | 8% | 18% |
+
+\* `student_public` / `research_instructor`. Difficulty is modulated
+by the simulation engine — signal strength on latent-trait weights,
+Gaussian noise on float features, MCAR missingness, outlier rate —
+not post-hoc label flipping.
+
+## The scenario
+
+**Veridian Technologies** is a fictional Series B startup (Austin, US)
+selling **Veridian Procure**, a procurement / AP automation SaaS, to
+mid-market firms (200–2,000 employees) in the US and UK. The funnel
+runs through inbound marketing (45%), SDR outbound (35%), and
+partner referrals (20%); four personas drive deals (VP Finance, AP
+Manager, IT Director, Procurement Manager). **Task:** predict whether
+a lead converts (`closed_won`) within 90 days. ACV bands are
+$18k–$120k. See
+[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)
+for the full DGP, and the deeper "what's modelled / approximate / not
+modelled" breakdown that this README only summarises.
+
+## Public vs instructor: what's redacted
+
+Filtering happens **during rendering**, not during simulation. The
+redaction contract is single-sourced in
+[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);
+the snapshot-safe writer and the validator import the same constants,
+so they cannot drift apart.
+
+| Source-of-truth constant | Public bundle treatment |
+|---|---|
+| `BANNED_LEAD_COLUMNS = ("converted_within_90_days", "conversion_timestamp")` | Dropped from `tables/leads.parquet` |
+| `BANNED_OPP_COLUMNS = ("close_outcome", "closed_at")` | Dropped from `tables/opportunities.parquet` |
+| `BANNED_TABLES = ("customers", "subscriptions")` | Omitted from public bundles |
+| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |
+| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |
+| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |
+
+Each bundle's `manifest.json` records `relational_snapshot_safe`,
+`redacted_columns`, and `snapshot_day`, so the bundle is
+self-describing.
+
+## Calibration
+
+Every realism / calibration / difficulty claim in this README is
+backed by
+[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),
+regenerated by
+[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)
+with bands declared in
+[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).
+Headline cross-seed medians (seeds 42–46):
+
+| Tier | LR AUC | AP | P@100 | Brier |
+|---|---|---|---|---|
+| intro | 0.879 | 0.761 | 0.80 | 0.130 |
+| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |
+| advanced | 0.886 | 0.351 | 0.34 | 0.061 |
+
+AP, P@100, conversion-rate, and lift orderings hold across the
+intended difficulty axis (intro > intermediate > advanced).
+
+## Intended uses
+
+- Teaching baseline lead-scoring on a flat snapshot.
+- Teaching relational feature engineering against snapshot-safe tables.
+- Teaching leakage detection (the `total_touches_all` trap is
+  designed to be discoverable).
+- Teaching calibration, lift, P@K, value-aware ranking
+  (`expected_acv × P(convert)`), and cohort-shift evaluation.
+- Comparing model families under a controlled DGP.
+
+## Out-of-scope uses
+
+- **Production lead scoring.** The company, product, and customers are
+  fictional.
+- **Vendor benchmarking / paper baselines.** Difficulty tiers are
+  calibrated for pedagogy, not cross-paper comparability.
+- **Causal-inference research that requires recovery of the true DGP.**
+  The instructor companion exposes the hidden graph for teaching, not
+  designed counterfactuals.
+- **Demographic / fairness research.** v1 does not model protected
+  attributes.
+
+## Known limitations
+
+- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across
+  every tier. Difficulty is visible in AP, P@K, Brier, and value
+  capture. Treat AUC as a sanity check, not a difficulty signal.
+- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta
+  is slightly negative in every tier (intro −0.0045, intermediate
+  −0.0072, advanced −0.0133); v1's snapshot is dominated by linear
+  features. v2 will inject non-linear interactions in the simulator.
+- **Channel signal is weak.** Per
+  [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),
+  out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across
+  all tiers and the per-channel rate spread is ≤0.05. The simulator
+  does not encode channel-conditional probabilities; channel-conditional
+  encoding is post-v1 work.
+- **Cohort-shift degradation is small.** v1 has no time-of-year drift
+  baked in; the cohort-shift gate (G6.4) is informational and will
+  bite in v2.
+
+## Composition
+
+- **Entities.** Accounts, contacts, leads, touches, sessions,
+  sales_activities, opportunities (public); plus customers and
+  subscriptions (instructor only). Per-row counts per bundle live in
+  `manifest.json`.
+- **Features.** 32 public columns grouped by analytical role in
+  [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);
+  the per-bundle `feature_dictionary.csv` is the authoritative
+  machine-readable spec.
+- **Label.** `converted_within_90_days` (boolean), event-derived from
+  the simulator. Never sampled directly.
+- **Splits.** 70/15/15 train/valid/test, deterministic given seed;
+  recorded in `tasks/converted_within_90_days/task_manifest.json`.
+- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package
+  version stamped in `manifest.json`.
+
+## Maintenance, adversarial framing, license
+
+We *want* the dataset to be broken. Issue templates ship under
+`.github/ISSUE_TEMPLATE/` (Phase 6); the break-me guide lands as
+`docs/release/break_me_guide.md` (PR 6.3). Once Phase 6 ships,
+`docs/release/v2_decision_log.md` will track every accepted finding
+and the design call that came from it. File issues at
+[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);
+PRs welcome.
+
+| Field | Value |
+|---|---|
+| Generator | leadforge `1.0.0+` |
+| Recipe | `b2b_saas_procurement_v1` |
+| Canonical seed | 42 (cross-seed sweep: 42–46) |
+| Bundle schema version | 5 |
+| Format | Parquet (canonical) + CSV (convenience) |
+| License | MIT — see [LICENSE](LICENSE) |
+
+Verify integrity with `leadforge validate <bundle_dir>`; every file
+is hashed in `manifest.json`.
diff --git a/release/kaggle/advanced/dataset_card.md b/release/kaggle/advanced/dataset_card.md
@@ -0,0 +1,74 @@
+# leadforge dataset card
+
+| Field | Value |
+|---|---|
+| Recipe | `b2b_saas_procurement_v1` |
+| Package version | `1.0.0` |
+| Seed | `42` |
+| Exposure mode | `student_public` |
+| Difficulty | `advanced` |
+| Horizon | 90 days |
+| Label window | 90 days |
+| Feature snapshot window | 30 days (windowed) |
+
+## Narrative summary
+
+**Vendor:** Veridian Technologies (Series B, founded 2017, Austin, US)
+
+**Product:** Veridian Procure — Procurement & AP Automation. Deployment: cloud_saas. Pricing: per_seat_annual. ACV range: $18,000–$120,000.
+
+**Target market:** 200–2000-employee firms in US, UK. Key industries: manufacturing, logistics, professional_services, healthcare_non_clinical. Average deal size: $42,000. Average sales cycle: 45 days.
+
+**GTM motion:** inbound_marketing, sdr_outbound, partner_referral (45% inbound / 35% outbound / 20% partner).
+
+**Buyer personas:**
+
+- **vp_finance** (economic_buyer) — VP Finance, CFO…
+- **ap_manager** (champion) — AP Manager, Accounts Payable Manager…
+- **it_director** (technical_evaluator) — IT Director, CTO…
+- **procurement_manager** (end_user) — Procurement Manager, Director of Procurement…
+
+## Primary task
+
+**Task:** `converted_within_90_days`
+
+**Label definition:** A lead is considered converted if a `closed_won` event is recorded within 90 days of the lead's snapshot anchor date. The label is event-derived — never sampled directly. All features are pre-anchor (leakage-free by construction).
+
+## Table inventory
+
+| Table | Rows |
+|---|---:|
+| accounts | 1,500 |
+| contacts | 4,200 |
+| leads | 5,000 |
+| touches | 38,208 |
+| sessions | 9,942 |
+| sales_activities | 19,995 |
+| opportunities | 4,004 |
+
+## Feature categories
+
+| Category | Count | Examples |
+|---|---:|---|
+| account | 6 | account_id, industry, region |
+| contact | 4 | contact_id, role_function, seniority |
+| lead_meta | 4 | lead_id, lead_created_at, lead_source |
+| engagement | 11 | touch_count, inbound_touch_count, outbound_touch_count |
+| sales | 6 | activity_count, days_since_last_touch, opportunity_created |
+| target | 1 |  |
+
+**Leakage-flagged columns:** `total_touches_all`. See `feature_dictionary.csv` for details.
+
+## Suggested use cases
+
+- Teaching binary classification on realistic CRM data
+- Portfolio projects demonstrating end-to-end ML pipelines
+- Benchmarking lead-scoring models under controlled signal/noise conditions
+- Research on causal structure in funnel conversion data
+
+## Caveats
+
+- This is **synthetic** data. It does not represent any real company, product, or market.
+- The hidden world structure varies by motif family and stochastic rewiring; no two seeds produce the same DGP.
+- The label is evaluated over the full 90-day window from lead creation; event-aggregate features (e.g. `touch_count`, `session_count`, `expected_acv`) observe only the first 30 days of that window. The deliberate exception is `total_touches_all`, which counts touches over the full 90-day horizon as a pedagogical leakage trap.
+- In `student_public` mode, the latent world graph, mechanism summary, and full world spec are withheld.