diff --git a/release/kaggle/dataset-metadata.json b/release/kaggle/dataset-metadata.json index dcda826..70731fc 100644 --- a/release/kaggle/dataset-metadata.json +++ b/release/kaggle/dataset-metadata.json @@ -1,8 +1,8 @@ { - "collaborators": [{"username": "derelictpanda", "role": "writer"}], - "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting *which* leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.\n\n[^macro]: Macroeconomic framing summarised in\n[`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md)\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier\n│ ├── manifest.json # provenance + file hashes\n│ ├── metrics.json # per-tier headline metrics (medians + spreads)\n│ ├── dataset_card.md # auto-rendered per-bundle card\n│ ├── feature_dictionary.csv # authoritative column spec\n│ ├── lead_scoring.csv # flat convenience CSV (all splits)\n│ ├── tables/*.parquet # 7 snapshot-safe relational tables\n│ └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── docs/ # vendored DGP / leakage / break-me docs (agent-readable)\n├── metrics.json # top-level cross-tier metrics summary\n├── claims_register.{md,json} # claims → backing-artifact map (agent-readable)\n├── dataset-metadata.json # Kaggle dataset metadata\n├── dataset-cover-image.png # Kaggle cover image\n├── README.md # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n### Agent-reviewable artifacts\n\nThe published bundle is self-contained for AI review and offline\nauditing — every numeric / structural claim on this page can be\nverified without following an external link:\n\n- **`metrics.json` (root) + `/metrics.json`** — deterministic\n JSON view of the headline LR AUC / AP / P@100 / Brier / conversion\n rate / cohort-shift / cross-tier-ordering medians, with JSON-path\n back-references to `validation/validation_report.json` (the\n source of truth).\n- **`claims_register.{md,json}`** — every numerical or structural\n claim on this page paired with the artifact and path that backs it.\n Rendered from `claims_register_source.yaml` by\n `scripts/build_claims_register.py`.\n- **`docs/`** — vendored copies of `generation_method.md`,\n `channel_signal_audit.md`, `break_me_guide.md`,\n `feature_dictionary.md`, `v1_acceptance_gates_bands.yaml`,\n `v2_decision_log.md`, plus a hand-authored\n `relational_table_schemas.csv` documenting every column of every\n relational table. These match the GitHub-blob links cited below but\n ship inside the bundle so a reviewer never needs network access.\n- **`/manifest.json`** — SHA-256 hash for every file plus the\n full redaction contract (`structural_redactions.columns`,\n `omitted_tables`, `relational_snapshot_safe`, `snapshot_day`).\n- Kaggle / HuggingFace preview pages additionally inject a\n `schema.org/Dataset` JSON-LD block in their `` for agent\n ingestion without HTML parsing.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n# --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Evaluation note — account and contact overlap\n\n**518 of 557 test accounts (≈93 %) appear in train** on the intermediate\nbundle; the other tiers are similar. Contact-level overlap is comparable\nin magnitude: most test contacts also have activity in the training set.\nThe random-split headline metrics therefore ride both account-level and\ncontact-level signal across the split boundary and over-estimate\ngeneralisation to unseen accounts and contacts. For a faithful\nout-of-sample number, retrain with `GroupKFold(account_id)` and report\nboth metrics. Notebook 02 demonstrates the detection recipe;\n[`break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) §5 gives\nthe worked example.\n\n## Dataset summary\n\n**Tiers are prevalence and noise axes, not modelling-complexity axes.**\nLR AUC is ~0.88 in every tier by design. The tiers differ in conversion\nrate, missingness, and noise — not rank discrimination. Choose a tier\nbased on the teaching exercise, not on expected AUC:\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| **Tier purpose** | High-prevalence warm-up | Default benchmark | Low-prevalence · calibration · noise exercise |\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 31 / 34* | 31 / 34* | 31 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (acceptance band, gate G7.\\*) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping. The acceptance band is the recipe\ngate's tolerance window (`v1_acceptance_gates_bands.yaml` G7.\\*),\nnot the achievable range — observed five-seed spreads sit\ncomfortably inside the band.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier | `calibration_max_bin_error` |\n|---|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 | **0.52** |\n\n**Reading this table:** LR AUC is flat across tiers by design — the\ntiers are a prevalence / noise axis, not a rank-discrimination axis.\nBrier score *improves* as prevalence falls (a prevalence effect, not\nbetter calibration); use `calibration_max_bin_error` to assess\ncalibration quality. Advanced's 0.52 max-bin error means the model's\npredicted probabilities are materially mis-scaled against actual\nconversion rates — a realistic miscalibration exercise.\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended prevalence axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n The instructor companion exposes the hidden graph for teaching, not\n designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n attributes.\n\n## Known limitations\n\n- **Tiers are a prevalence / noise axis, not a modelling-complexity\n axis.** LR AUC is ~0.88 in every tier; the three tiers differ in\n conversion rate (43% / 22% / 8%), noise scale, and missingness —\n not in rank discrimination. Use AP, P@K, and calibration metrics\n to see the difficulty gradient; AUC alone will not show it.\n- **93% account and contact overlap across train / test splits.** Random\n splits are keyed on lead ID; most test accounts and contacts also\n appear in train. Headline metrics over-state generalisation to unseen\n accounts and contacts. Use `GroupKFold(account_id)` for a faithful\n estimate.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n is slightly negative in every tier (intro −0.0045, intermediate\n −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n all tiers and the per-channel rate spread is ≤0.05. The simulator\n does not encode channel-conditional probabilities; channel-conditional\n encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n baked in; the cohort-shift gate (G6.4) is informational and will\n bite in v2.\n- **Advanced-tier noise can produce artifact zeros in count and duration\n columns.** Gaussian noise is applied before MCAR missingness; the\n snapshot builder clamps results below zero to zero. What users observe\n is therefore not negative values but zeros that may be noise artifacts\n rather than true zero values — e.g. `days_since_last_touch = 0` might\n mean \"noised below zero, clamped\" rather than \"touched today\". Treat\n suspicious zero clusters in the Advanced tier as intentional\n data-cleaning exercise material.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n sales_activities, opportunities (public); plus customers and\n subscriptions (instructor only). Per-row counts per bundle live in\n `manifest.json`.\n- **Features.** 31 public columns grouped by analytical role in\n [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n the per-bundle `feature_dictionary.csv` is the authoritative\n machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n recorded in `tasks/converted_within_90_days/task_manifest.json`.\n Splits are keyed on `lead_id`; see the *Evaluation note* above for\n the account-overlap caveat.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. The\n[break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under `.github/ISSUE_TEMPLATE/`: a\n[breakage report](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml)\nform for findings on the bundle itself, and a\n[realism feedback](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml)\nform for distributional critiques. Accepted findings are\nlogged in\n[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).\nFile issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate `; every file\nis hashed in `manifest.json`.\n", + "collaborators": [], + "description": "# LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`)\n\nA relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Created by\n[Shay Palachy Affek](https://www.shaypalachy.com/) and generated by\n[leadforge](https://github.com/leadforge-dev/leadforge), an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at `1.x`; the dataset is published under the explicit `…-v1`\ntag.\n\n## Why lead scoring matters in 2024–2026\n\nMid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising (median public-SaaS growth 30%→25%\nfrom 2023 to 2025; New CAC Ratio rose materially in 2024), so\npredicting *which* leads convert within a fixed window has moved from\na marketing nicety to a survival skill. This dataset teaches that\nskill on a relational substrate, with the realistic confusions\n(snapshot-window discipline, leakage traps, channel signal weaker than\nvendor blogs imply) that students will hit when they finally get hands\non real CRM data.\n\n## What's inside\n\n```\n.\n├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier\n│ ├── manifest.json # provenance + file hashes\n│ ├── metrics.json # per-tier headline metrics (medians + spreads)\n│ ├── dataset_card.md # auto-rendered per-bundle card\n│ ├── feature_dictionary.csv # authoritative column spec\n│ ├── lead_scoring.csv # flat convenience CSV (all splits)\n│ ├── tables/*.parquet # 7 snapshot-safe relational tables\n│ └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── docs/ # vendored DGP / leakage / break-me docs (agent-readable)\n├── metrics.json # top-level cross-tier metrics summary\n├── claims_register.{md,json} # claims → backing-artifact map (agent-readable)\n├── dataset-metadata.json # Kaggle dataset metadata\n├── dataset-cover-image.png # Kaggle cover image\n├── README.md # Kaggle package README\n└── LICENSE\n```\n\n`student_public` bundles ship the snapshot-safe relational view;\n`research_instructor` companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder `metadata/`. The full layout is documented in each bundle's\n`manifest.json`.\n\n### Agent-reviewable artifacts\n\nThe published bundle is self-contained for AI review and offline\nauditing — every numeric / structural claim on this page can be\nverified without following an external link:\n\n- **`metrics.json` (root) + `/metrics.json`** — deterministic\n JSON view of the headline LR AUC / AP / P@100 / Brier / conversion\n rate / cohort-shift / cross-tier-ordering medians, with JSON-path\n back-references to `validation/validation_report.json` (the\n source of truth).\n- **`claims_register.{md,json}`** — every numerical or structural\n claim on this page paired with the artifact and path that backs it.\n Rendered from `claims_register_source.yaml` by\n `scripts/build_claims_register.py`.\n- **`docs/`** — vendored copies of `generation_method.md`,\n `channel_signal_audit.md`, `break_me_guide.md`,\n `feature_dictionary.md`, `v1_acceptance_gates_bands.yaml`,\n `v2_decision_log.md`, plus a hand-authored\n `relational_table_schemas.csv` documenting every column of every\n relational table. These match the GitHub-blob links cited below but\n ship inside the bundle so a reviewer never needs network access.\n- **`/manifest.json`** — SHA-256 hash for every file plus the\n full redaction contract (`structural_redactions.columns`,\n `omitted_tables`, `relational_snapshot_safe`, `snapshot_day`).\n- Kaggle / HuggingFace preview pages additionally inject a\n `schema.org/Dataset` JSON-LD block in their `` for agent\n ingestion without HTML parsing.\n\n## Quick start\n\n```python\n# Flat CSV\ndf = pd.read_csv(\"intermediate/lead_scoring.csv\")\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/train.parquet\")\ntest = pd.read_parquet(\"intermediate/tasks/converted_within_90_days/test.parquet\")\n\n# Relational tables (feature engineering — example)\nleads = pd.read_parquet(\"intermediate/tables/leads.parquet\")\ntouches = pd.read_parquet(\"intermediate/tables/touches.parquet\")\nmy_touch_count = (\n touches.groupby(\"lead_id\").size().rename(\"my_touch_count\").reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=\"lead_id\", how=\"left\")\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n# --mode student_public --difficulty intermediate --out my_bundle\n```\n\nThe label `converted_within_90_days` resolves over a 90-day window;\nengagement features (`touch_count`, `session_count`, etc.) are\ncomputed strictly over events on days `[0, 30]`. The deliberate\nexception is `total_touches_all`, the leakage trap — flagged\n`leakage_risk=True` in `feature_dictionary.csv`. Drop it from your\nfeature set unless you're demonstrating leakage detection.\n\n## Evaluation note — account and contact overlap\n\n**518 of 557 test accounts (≈93 %) appear in train** on the intermediate\nbundle; the other tiers are similar. Contact-level overlap is comparable\nin magnitude: most test contacts also have activity in the training set.\nThe random-split headline metrics therefore ride both account-level and\ncontact-level signal across the split boundary and over-estimate\ngeneralisation to unseen accounts and contacts. For a faithful\nout-of-sample number, retrain with `GroupKFold(account_id)` and report\nboth metrics. Notebook 02 demonstrates the detection recipe;\n[`break_me_guide.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) §5 gives\nthe worked example.\n\n## Dataset summary\n\n**Tiers are prevalence and noise axes, not modelling-complexity axes.**\nLR AUC is ~0.88 in every tier by design. The tiers differ in conversion\nrate, missingness, and noise — not rank discrimination. Choose a tier\nbased on the teaching exercise, not on expected AUC:\n\n| | Intro | Intermediate | Advanced |\n|---|---|---|---|\n| **Tier purpose** | High-prevalence warm-up | Default benchmark | Low-prevalence · calibration · noise exercise |\n| Leads | 5,000 | 5,000 | 5,000 |\n| Accounts | 1,500 | 1,500 | 1,500 |\n| Contacts | 4,200 | 4,200 | 4,200 |\n| Snapshot columns | 31 / 34* | 31 / 34* | 31 / 34* |\n| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |\n| Conversion rate (acceptance band, gate G7.\\*) | 24–61% | 12–31% | 4–12% |\n| Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |\n| Signal strength | 0.90 | 0.70 | 0.50 |\n| Noise scale | 0.10 | 0.30 | 0.55 |\n| Missing rate | 2% | 8% | 18% |\n\n\\* `student_public` / `research_instructor`. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping. The acceptance band is the recipe\ngate's tolerance window (`v1_acceptance_gates_bands.yaml` G7.\\*),\nnot the achievable range — observed five-seed spreads sit\ncomfortably inside the band.\n\n## The scenario\n\n**Veridian Technologies** is a fictional Series B startup (Austin, US)\nselling **Veridian Procure**, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). **Task:** predict whether\na lead converts (`closed_won`) within 90 days. ACV bands are\n$18k–$120k. See\n[`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md)\nfor the full DGP, and the deeper \"what's modelled / approximate / not\nmodelled\" breakdown that this README only summarises.\n\n## Public vs instructor: what's redacted\n\nFiltering happens **during rendering**, not during simulation. The\nredaction contract is single-sourced in\n[`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py);\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.\n\n| Source-of-truth constant | Public bundle treatment |\n|---|---|\n| `BANNED_LEAD_COLUMNS = (\"converted_within_90_days\", \"conversion_timestamp\")` | Dropped from `tables/leads.parquet` |\n| `BANNED_OPP_COLUMNS = (\"close_outcome\", \"closed_at\")` | Dropped from `tables/opportunities.parquet` |\n| `BANNED_TABLES = (\"customers\", \"subscriptions\")` | Omitted from public bundles |\n| `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` |\n| Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` |\n| `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` |\n\nEach bundle's `manifest.json` records `relational_snapshot_safe`,\n`redacted_columns`, and `snapshot_day`, so the bundle is\nself-describing.\n\n## Calibration\n\nEvery realism / calibration / difficulty claim in this README is\nbacked by\n[`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md),\nregenerated by\n[`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py)\nwith bands declared in\n[`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml).\nHeadline cross-seed medians (seeds 42–46):\n\n| Tier | LR AUC | AP | P@100 | Brier | `calibration_max_bin_error` |\n|---|---|---|---|---|---|\n| intro | 0.879 | 0.761 | 0.80 | 0.130 | 0.25 |\n| intermediate | 0.886 | 0.575 | 0.59 | 0.110 | 0.25 |\n| advanced | 0.886 | 0.351 | 0.34 | 0.061 | **0.52** |\n\n**Reading this table:** LR AUC is flat across tiers by design — the\ntiers are a prevalence / noise axis, not a rank-discrimination axis.\nBrier score *improves* as prevalence falls (a prevalence effect, not\nbetter calibration); use `calibration_max_bin_error` to assess\ncalibration quality. Advanced's 0.52 max-bin error means the model's\npredicted probabilities are materially mis-scaled against actual\nconversion rates — a realistic miscalibration exercise.\n\nAP, P@100, conversion-rate, and lift orderings hold across the\nintended prevalence axis (intro > intermediate > advanced).\n\n## Intended uses\n\n- Teaching baseline lead-scoring on a flat snapshot.\n- Teaching relational feature engineering against snapshot-safe tables.\n- Teaching leakage detection (the `total_touches_all` trap is\n designed to be discoverable).\n- Teaching calibration, lift, P@K, value-aware ranking\n (`expected_acv × P(convert)`), and cohort-shift evaluation.\n- Comparing model families under a controlled DGP.\n\n## Out-of-scope uses\n\n- **Production lead scoring.** The company, product, and customers are\n fictional.\n- **Vendor benchmarking / paper baselines.** Difficulty tiers are\n calibrated for pedagogy, not cross-paper comparability.\n- **Causal-inference research that requires recovery of the true DGP.**\n The instructor companion exposes the hidden graph for teaching, not\n designed counterfactuals.\n- **Demographic / fairness research.** v1 does not model protected\n attributes.\n\n## Known limitations\n\n- **Tiers are a prevalence / noise axis, not a modelling-complexity\n axis.** LR AUC is ~0.88 in every tier; the three tiers differ in\n conversion rate (43% / 22% / 8%), noise scale, and missingness —\n not in rank discrimination. Use AP, P@K, and calibration metrics\n to see the difficulty gradient; AUC alone will not show it.\n- **93% account and contact overlap across train / test splits.** Random\n splits are keyed on lead ID; most test accounts and contacts also\n appear in train. Headline metrics over-state generalisation to unseen\n accounts and contacts. Use `GroupKFold(account_id)` for a faithful\n estimate.\n- **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta\n is slightly negative in every tier (intro −0.0045, intermediate\n −0.0072, advanced −0.0133); v1's snapshot is dominated by linear\n features. v2 will inject non-linear interactions in the simulator.\n- **Channel signal is weak.** Per\n [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md),\n out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across\n all tiers and the per-channel rate spread is ≤0.05. The simulator\n does not encode channel-conditional probabilities; channel-conditional\n encoding is post-v1 work.\n- **Cohort-shift degradation is small.** v1 has no time-of-year drift\n baked in; the cohort-shift gate (G6.4) is informational and will\n bite in v2.\n- **Advanced-tier noise can produce artifact zeros in count and duration\n columns.** Gaussian noise is applied before MCAR missingness; the\n snapshot builder clamps results below zero to zero. What users observe\n is therefore not negative values but zeros that may be noise artifacts\n rather than true zero values — e.g. `days_since_last_touch = 0` might\n mean \"noised below zero, clamped\" rather than \"touched today\". Treat\n suspicious zero clusters in the Advanced tier as intentional\n data-cleaning exercise material.\n\n## Composition\n\n- **Entities.** Accounts, contacts, leads, touches, sessions,\n sales_activities, opportunities (public); plus customers and\n subscriptions (instructor only). Per-row counts per bundle live in\n `manifest.json`.\n- **Features.** 31 public columns grouped by analytical role in\n [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md);\n the per-bundle `feature_dictionary.csv` is the authoritative\n machine-readable spec.\n- **Label.** `converted_within_90_days` (boolean), event-derived from\n the simulator. Never sampled directly.\n- **Splits.** 70/15/15 train/valid/test, deterministic given seed;\n recorded in `tasks/converted_within_90_days/task_manifest.json`.\n Splits are keyed on `lead_id`; see the *Evaluation note* above for\n the account-overlap caveat.\n- **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package\n version stamped in `manifest.json`.\n\n## Maintenance, adversarial framing, license\n\nWe *want* the dataset to be broken. The\n[break-me guide](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md) catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under `.github/ISSUE_TEMPLATE/`: a\n[breakage report](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml)\nform for findings on the bundle itself, and a\n[realism feedback](https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml)\nform for distributional critiques. Accepted findings are\nlogged in\n[`docs/release/v2_decision_log.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md).\nFile issues at\n[leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge);\nPRs welcome.\n\n| Field | Value |\n|---|---|\n| Generator | leadforge `1.0.0+` |\n| Recipe | `b2b_saas_procurement_v1` |\n| Canonical seed | 42 (cross-seed sweep: 42–46) |\n| Bundle schema version | 5 |\n| Format | Parquet (canonical) + CSV (convenience) |\n| License | MIT — see [LICENSE](LICENSE) |\n\nVerify integrity with `leadforge validate `; every file\nis hashed in `manifest.json`.\n\n## Credits\n\nCreated by [Shay Palachy Affek](https://www.shaypalachy.com/).\nDataset generated with [leadforge](https://github.com/leadforge-dev/leadforge) (MIT).\nProfiles: [HuggingFace](https://huggingface.co/shaypal5) · [Kaggle](https://www.kaggle.com/derelictpanda) · [GitHub](https://github.com/shaypalachy)\n", "expectedUpdateFrequency": "never", - "id": "leadforge/leadforge-lead-scoring-v1", + "id": "derelictpanda/leadforge-lead-scoring-v1", "image": "dataset-cover-image.png", "isPrivate": false, "keywords": [ diff --git a/scripts/package_kaggle_release.py b/scripts/package_kaggle_release.py index 792c59f..520dcec 100644 --- a/scripts/package_kaggle_release.py +++ b/scripts/package_kaggle_release.py @@ -119,7 +119,7 @@ # Release-specific defaults (G1.2 dataset slug + G11 metadata content) # --------------------------------------------------------------------------- -DEFAULT_USER_SLUG: Final[str] = "leadforge" +DEFAULT_USER_SLUG: Final[str] = "derelictpanda" DEFAULT_DATASET_SLUG: Final[str] = "leadforge-lead-scoring-v1" DEFAULT_TITLE: Final[str] = "LeadForge: Synthetic B2B Lead Scoring (v1)"