Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,17 @@ Goal: ship a best-in-class educational synthetic CRM lead-scoring dataset family
- [x] PR 4.1: `release/README.md` (substantial rewrite) — release-grade dataset card per Datasheets-for-Datasets / Data Cards Playbook checklist (G10.1). New sections: macro framing paragraph (2024–2026 SaaS context, recommendation #19), simulation simplifications (modelled / approximate / not modelled, per chatgpt v2 §2.6), calibration documentation linking to `release/validation/validation_report.md`, public-vs-instructor redaction policy with concrete column lists citing `BANNED_LEAD_COLUMNS` / `BANNED_OPP_COLUMNS` / `BANNED_TABLES` / `SNAPSHOT_FILTERED_TABLES` from `leadforge/validation/leakage_probes.py`, intended-use vs out-of-scope-use, known limitations (G7.4.4 GBM−LR sign finding, weak channel signal from the Phase 4 audit, flat AUC across tiers, small cohort-shift gap), composition section per Datasheets format, adversarial-framing pointer (placeholder link to `docs/release/break_me_guide.md` that lands in PR 6.3), and a maintenance plan. Every realism / calibration / difficulty claim in the card is anchored to `validation_report.md` per G10.6. `BUNDLE_SCHEMA_VERSION` unchanged at 5 (documentation-only PR); 1167/1167 tests pass; ruff + mypy clean; `scripts/probe_relational_leakage.py release/{intro,intermediate,advanced} --max-accuracy 0.65` exits 0 on every public tier; `scripts/verify_hash_determinism.py` PASS 67/67; `scripts/validate_release_candidate.py --no-rebuild` exits 0.

### Phase 5 — Platform packaging
- [ ] `scripts/package_kaggle_release.py` → `release/kaggle/dataset-metadata.json`
- [x] PR 5.1: `scripts/package_kaggle_release.py` (new) — Kaggle dry-run packager. Reads the
three public release bundles, validates Kaggle metadata constraints (`title`, `subtitle`, slug
`id`, single MIT license, `expectedUpdateFrequency=never`, ordered resource schemas,
user-specified sources, keywords, and image), writes `release/kaggle/dataset-metadata.json`, and
assembles `release/kaggle/` with the public tier files, README, license, and cover image. Cover
source decision: reproducible hand-designed funnel diagram generated by the packager, avoiding
stock licensing risk while keeping the asset deterministic. Tests cover metadata validation,
ordered schema fields, cover dimensions, upload-dir shape, CLI error handling, and byte-identical
metadata regeneration against the committed artifact.
- [ ] `scripts/package_hf_release.py` → `release/huggingface/README.md` with YAML configs/default/pretty_name/tags
- [ ] `release/dataset-cover-image.png` (≥560×280)
- [x] `release/dataset-cover-image.png` (≥560×280)
- [ ] Local `load_dataset()` smoke test; Kaggle dry-run package validation

### Phase 6 — Notebook sequence + adversarial framing
Expand Down
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,12 @@ dev = [
"types-pyyaml>=6.0",
"scikit-learn>=1.3",
"matplotlib>=3.7",
"pillow>=10.0",
]
scripts = [
"scikit-learn>=1.3",
"matplotlib>=3.7",
"pillow>=10.0",
]

[project.scripts]
Expand Down
Binary file added release/dataset-cover-image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
21 changes: 21 additions & 0 deletions release/kaggle/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2026 leadforge-dev

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
233 changes: 233 additions & 0 deletions release/kaggle/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)

A relational, reproducible, three-tier synthetic CRM dataset family for
teaching lead scoring at scale. Generated by
[leadforge](https://github.com/leadforge-dev/leadforge), an
open-source Python framework for synthetic CRM/funnel data. The
framework version is decoupled from the dataset version: the package
stays at `1.x`; the dataset is published under the explicit `…-v1`
tag.

## Why lead scoring matters in 2024–2026

Mid-market SaaS vendors entered 2024–2026 with growth slowing and
customer-acquisition costs rising[^macro], so predicting *which* leads
convert within a fixed window has moved from a marketing nicety to a
survival skill. This dataset teaches that skill on a relational
substrate, with the realistic confusions (snapshot-window discipline,
leakage traps, channel signal weaker than vendor blogs imply) that
students will hit when they finally get hands on real CRM data.

[^macro]: Macroeconomic framing summarised in
[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)
(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio
rose materially in 2024).

## What's inside

```
.
├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier
│ ├── manifest.json # provenance + file hashes
│ ├── dataset_card.md # auto-rendered per-bundle card
│ ├── feature_dictionary.csv # authoritative column spec
│ ├── lead_scoring.csv # flat convenience CSV (all splits)
│ ├── tables/*.parquet # 7 snapshot-safe relational tables
│ └── tasks/converted_within_90_days/{train,valid,test}.parquet
├── dataset-metadata.json # Kaggle dataset metadata
├── dataset-cover-image.png # Kaggle cover image
├── README.md # Kaggle package README
└── LICENSE
```
Comment thread
shaypal5 marked this conversation as resolved.

`student_public` bundles ship the snapshot-safe relational view;
`research_instructor` companions ship the full-horizon view plus the
hidden causal structure (DAG, latent registry, mechanism summary)
under `metadata/`. The full layout is documented in each bundle's
`manifest.json`.

## Quick start

```python
# Flat CSV
df = pd.read_csv("intermediate/lead_scoring.csv")

# Parquet task splits (recommended)
train = pd.read_parquet("intermediate/tasks/converted_within_90_days/train.parquet")
test = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet")

# Relational tables (feature engineering — example)
leads = pd.read_parquet("intermediate/tables/leads.parquet")
touches = pd.read_parquet("intermediate/tables/touches.parquet")
my_touch_count = (
touches.groupby("lead_id").size().rename("my_touch_count").reset_index()
)
features = leads.merge(my_touch_count, on="lead_id", how="left")

# Reproduce from source
# pip install leadforge
# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \
# --mode student_public --difficulty intermediate --out my_bundle
```

The label `converted_within_90_days` resolves over a 90-day window;
engagement features (`touch_count`, `session_count`, etc.) are
computed strictly over events on days `[0, 30]`. The deliberate
exception is `total_touches_all`, the leakage trap — flagged
`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your
feature set unless you're demonstrating leakage detection.

## Dataset summary

| | Intro | Intermediate | Advanced |
|---|---|---|---|
| Leads | 5,000 | 5,000 | 5,000 |
| Accounts | 1,500 | 1,500 | 1,500 |
| Contacts | 4,200 | 4,200 | 4,200 |
| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |
| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |
| Conversion rate (recipe band) | 24–61% | 12–31% | 4–12% |
| Conversion rate (median, seeds 42–46) | 42.67% | 21.60% | 8.40% |
| Signal strength | 0.90 | 0.70 | 0.50 |
| Noise scale | 0.10 | 0.30 | 0.55 |
| Missing rate | 2% | 8% | 18% |

\* `student_public` / `research_instructor`. Difficulty is modulated
by the simulation engine — signal strength on latent-trait weights,
Gaussian noise on float features, MCAR missingness, outlier rate —
not post-hoc label flipping.

## The scenario

**Veridian Technologies** is a fictional Series B startup (Austin, US)
selling **Veridian Procure**, a procurement / AP automation SaaS, to
mid-market firms (200–2,000 employees) in the US and UK. The funnel
runs through inbound marketing (45%), SDR outbound (35%), and
partner referrals (20%); four personas drive deals (VP Finance, AP
Manager, IT Director, Procurement Manager). **Task:** predict whether
a lead converts (`closed_won`) within 90 days. ACV bands are
$18k–$120k. See
[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)
for the full DGP, and the deeper "what's modelled / approximate / not
modelled" breakdown that this README only summarises.

## Public vs instructor: what's redacted

Filtering happens **during rendering**, not during simulation. The
redaction contract is single-sourced in
[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);
the snapshot-safe writer and the validator import the same constants,
so they cannot drift apart.

| Source-of-truth constant | Public bundle treatment |
|---|---|
| `BANNED_LEAD_COLUMNS = ("converted_within_90_days", "conversion_timestamp")` | Dropped from `tables/leads.parquet` |
| `BANNED_OPP_COLUMNS = ("close_outcome", "closed_at")` | Dropped from `tables/opportunities.parquet` |
| `BANNED_TABLES = ("customers", "subscriptions")` | Omitted from public bundles |
| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |
| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |
| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |

Each bundle's `manifest.json` records `relational_snapshot_safe`,
`redacted_columns`, and `snapshot_day`, so the bundle is
self-describing.

## Calibration

Every realism / calibration / difficulty claim in this README is
backed by
[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),
regenerated by
[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)
with bands declared in
[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).
Headline cross-seed medians (seeds 42–46):

| Tier | LR AUC | AP | P@100 | Brier |
|---|---|---|---|---|
| intro | 0.879 | 0.761 | 0.80 | 0.130 |
| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |
| advanced | 0.886 | 0.351 | 0.34 | 0.061 |

AP, P@100, conversion-rate, and lift orderings hold across the
intended difficulty axis (intro > intermediate > advanced).

## Intended uses

- Teaching baseline lead-scoring on a flat snapshot.
- Teaching relational feature engineering against snapshot-safe tables.
- Teaching leakage detection (the `total_touches_all` trap is
designed to be discoverable).
- Teaching calibration, lift, P@K, value-aware ranking
(`expected_acv × P(convert)`), and cohort-shift evaluation.
- Comparing model families under a controlled DGP.

## Out-of-scope uses

- **Production lead scoring.** The company, product, and customers are
fictional.
- **Vendor benchmarking / paper baselines.** Difficulty tiers are
calibrated for pedagogy, not cross-paper comparability.
- **Causal-inference research that requires recovery of the true DGP.**
The instructor companion exposes the hidden graph for teaching, not
designed counterfactuals.
- **Demographic / fairness research.** v1 does not model protected
attributes.

## Known limitations

- **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across
every tier. Difficulty is visible in AP, P@K, Brier, and value
capture. Treat AUC as a sanity check, not a difficulty signal.
- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta
is slightly negative in every tier (intro −0.0045, intermediate
−0.0072, advanced −0.0133); v1's snapshot is dominated by linear
features. v2 will inject non-linear interactions in the simulator.
- **Channel signal is weak.** Per
[`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),
out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across
all tiers and the per-channel rate spread is ≤0.05. The simulator
does not encode channel-conditional probabilities; channel-conditional
encoding is post-v1 work.
- **Cohort-shift degradation is small.** v1 has no time-of-year drift
baked in; the cohort-shift gate (G6.4) is informational and will
bite in v2.

## Composition

- **Entities.** Accounts, contacts, leads, touches, sessions,
sales_activities, opportunities (public); plus customers and
subscriptions (instructor only). Per-row counts per bundle live in
`manifest.json`.
- **Features.** 32 public columns grouped by analytical role in
[`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);
the per-bundle `feature_dictionary.csv` is the authoritative
machine-readable spec.
- **Label.** `converted_within_90_days` (boolean), event-derived from
the simulator. Never sampled directly.
- **Splits.** 70/15/15 train/valid/test, deterministic given seed;
recorded in `tasks/converted_within_90_days/task_manifest.json`.
- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package
version stamped in `manifest.json`.

## Maintenance, adversarial framing, license

We *want* the dataset to be broken. Issue templates ship under
`.github/ISSUE_TEMPLATE/` (Phase 6); the break-me guide lands as
`docs/release/break_me_guide.md` (PR 6.3). Once Phase 6 ships,
`docs/release/v2_decision_log.md` will track every accepted finding
and the design call that came from it. File issues at
[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);
PRs welcome.

| Field | Value |
|---|---|
| Generator | leadforge `1.0.0+` |
| Recipe | `b2b_saas_procurement_v1` |
| Canonical seed | 42 (cross-seed sweep: 42–46) |
| Bundle schema version | 5 |
| Format | Parquet (canonical) + CSV (convenience) |
| License | MIT — see [LICENSE](LICENSE) |

Verify integrity with `leadforge validate <bundle_dir>`; every file
is hashed in `manifest.json`.
74 changes: 74 additions & 0 deletions release/kaggle/advanced/dataset_card.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# leadforge dataset card

| Field | Value |
|---|---|
| Recipe | `b2b_saas_procurement_v1` |
| Package version | `1.0.0` |
| Seed | `42` |
| Exposure mode | `student_public` |
| Difficulty | `advanced` |
| Horizon | 90 days |
| Label window | 90 days |
| Feature snapshot window | 30 days (windowed) |

## Narrative summary

**Vendor:** Veridian Technologies (Series B, founded 2017, Austin, US)

**Product:** Veridian Procure — Procurement & AP Automation. Deployment: cloud_saas. Pricing: per_seat_annual. ACV range: $18,000–$120,000.

**Target market:** 200–2000-employee firms in US, UK. Key industries: manufacturing, logistics, professional_services, healthcare_non_clinical. Average deal size: $42,000. Average sales cycle: 45 days.

**GTM motion:** inbound_marketing, sdr_outbound, partner_referral (45% inbound / 35% outbound / 20% partner).

**Buyer personas:**

- **vp_finance** (economic_buyer) — VP Finance, CFO…
- **ap_manager** (champion) — AP Manager, Accounts Payable Manager…
- **it_director** (technical_evaluator) — IT Director, CTO…
- **procurement_manager** (end_user) — Procurement Manager, Director of Procurement…

## Primary task

**Task:** `converted_within_90_days`

**Label definition:** A lead is considered converted if a `closed_won` event is recorded within 90 days of the lead's snapshot anchor date. The label is event-derived — never sampled directly. All features are pre-anchor (leakage-free by construction).

## Table inventory

| Table | Rows |
|---|---:|
| accounts | 1,500 |
| contacts | 4,200 |
| leads | 5,000 |
| touches | 38,208 |
| sessions | 9,942 |
| sales_activities | 19,995 |
| opportunities | 4,004 |

## Feature categories

| Category | Count | Examples |
|---|---:|---|
| account | 6 | account_id, industry, region |
| contact | 4 | contact_id, role_function, seniority |
| lead_meta | 4 | lead_id, lead_created_at, lead_source |
| engagement | 11 | touch_count, inbound_touch_count, outbound_touch_count |
| sales | 6 | activity_count, days_since_last_touch, opportunity_created |
| target | 1 | |

**Leakage-flagged columns:** `total_touches_all`. See `feature_dictionary.csv` for details.

## Suggested use cases

- Teaching binary classification on realistic CRM data
- Portfolio projects demonstrating end-to-end ML pipelines
- Benchmarking lead-scoring models under controlled signal/noise conditions
- Research on causal structure in funnel conversion data

## Caveats

- This is **synthetic** data. It does not represent any real company, product, or market.
- The hidden world structure varies by motif family and stochastic rewiring; no two seeds produce the same DGP.
- The label is evaluated over the full 90-day window from lead creation; event-aggregate features (e.g. `touch_count`, `session_count`, `expected_acv`) observe only the first 30 days of that window. The deliberate exception is `total_touches_all`, which counts touches over the full 90-day horizon as a pedagogical leakage trap.
- In `student_public` mode, the latent world graph, mechanism summary, and full world spec are withheld.
Loading
Loading