-
Notifications
You must be signed in to change notification settings - Fork 0
feat(scripts): Kaggle release packager + cover image #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
b3a2966
feat(scripts): package Kaggle release artifacts
shaypal5 ca75729
fix(scripts): keep Kaggle metadata schema strict
shaypal5 ea2e076
fix(tests): use committed Kaggle fixtures in CI
shaypal5 f2e4f9a
fix(scripts): harden Kaggle package review points
shaypal5 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| MIT License | ||
|
|
||
| Copyright (c) 2026 leadforge-dev | ||
|
|
||
| Permission is hereby granted, free of charge, to any person obtaining a copy | ||
| of this software and associated documentation files (the "Software"), to deal | ||
| in the Software without restriction, including without limitation the rights | ||
| to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
| copies of the Software, and to permit persons to whom the Software is | ||
| furnished to do so, subject to the following conditions: | ||
|
|
||
| The above copyright notice and this permission notice shall be included in all | ||
| copies or substantial portions of the Software. | ||
|
|
||
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
| FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
| LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
| OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
| SOFTWARE. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,233 @@ | ||
| # LeadForge: Synthetic B2B Lead Scoring Dataset (`leadforge-lead-scoring-v1`) | ||
|
|
||
| A relational, reproducible, three-tier synthetic CRM dataset family for | ||
| teaching lead scoring at scale. Generated by | ||
| [leadforge](https://github.com/leadforge-dev/leadforge), an | ||
| open-source Python framework for synthetic CRM/funnel data. The | ||
| framework version is decoupled from the dataset version: the package | ||
| stays at `1.x`; the dataset is published under the explicit `…-v1` | ||
| tag. | ||
|
|
||
| ## Why lead scoring matters in 2024–2026 | ||
|
|
||
| Mid-market SaaS vendors entered 2024–2026 with growth slowing and | ||
| customer-acquisition costs rising[^macro], so predicting *which* leads | ||
| convert within a fixed window has moved from a marketing nicety to a | ||
| survival skill. This dataset teaches that skill on a relational | ||
| substrate, with the realistic confusions (snapshot-window discipline, | ||
| leakage traps, channel signal weaker than vendor blogs imply) that | ||
| students will hit when they finally get hands on real CRM data. | ||
|
|
||
| [^macro]: Macroeconomic framing summarised in | ||
| [`docs/external_review/summaries/gemini_v2_summary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md) | ||
| (median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio | ||
| rose materially in 2024). | ||
|
|
||
| ## What's inside | ||
|
|
||
| ``` | ||
| . | ||
| ├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier | ||
| │ ├── manifest.json # provenance + file hashes | ||
| │ ├── dataset_card.md # auto-rendered per-bundle card | ||
| │ ├── feature_dictionary.csv # authoritative column spec | ||
| │ ├── lead_scoring.csv # flat convenience CSV (all splits) | ||
| │ ├── tables/*.parquet # 7 snapshot-safe relational tables | ||
| │ └── tasks/converted_within_90_days/{train,valid,test}.parquet | ||
| ├── dataset-metadata.json # Kaggle dataset metadata | ||
| ├── dataset-cover-image.png # Kaggle cover image | ||
| ├── README.md # Kaggle package README | ||
| └── LICENSE | ||
| ``` | ||
|
|
||
| `student_public` bundles ship the snapshot-safe relational view; | ||
| `research_instructor` companions ship the full-horizon view plus the | ||
| hidden causal structure (DAG, latent registry, mechanism summary) | ||
| under `metadata/`. The full layout is documented in each bundle's | ||
| `manifest.json`. | ||
|
|
||
| ## Quick start | ||
|
|
||
| ```python | ||
| # Flat CSV | ||
| df = pd.read_csv("intermediate/lead_scoring.csv") | ||
|
|
||
| # Parquet task splits (recommended) | ||
| train = pd.read_parquet("intermediate/tasks/converted_within_90_days/train.parquet") | ||
| test = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet") | ||
|
|
||
| # Relational tables (feature engineering — example) | ||
| leads = pd.read_parquet("intermediate/tables/leads.parquet") | ||
| touches = pd.read_parquet("intermediate/tables/touches.parquet") | ||
| my_touch_count = ( | ||
| touches.groupby("lead_id").size().rename("my_touch_count").reset_index() | ||
| ) | ||
| features = leads.merge(my_touch_count, on="lead_id", how="left") | ||
|
|
||
| # Reproduce from source | ||
| # pip install leadforge | ||
| # leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \ | ||
| # --mode student_public --difficulty intermediate --out my_bundle | ||
| ``` | ||
|
|
||
| The label `converted_within_90_days` resolves over a 90-day window; | ||
| engagement features (`touch_count`, `session_count`, etc.) are | ||
| computed strictly over events on days `[0, 30]`. The deliberate | ||
| exception is `total_touches_all`, the leakage trap — flagged | ||
| `leakage_risk=True` in `feature_dictionary.csv`. Drop it from your | ||
| feature set unless you're demonstrating leakage detection. | ||
|
|
||
| ## Dataset summary | ||
|
|
||
| | | Intro | Intermediate | Advanced | | ||
| |---|---|---|---| | ||
| | Leads | 5,000 | 5,000 | 5,000 | | ||
| | Accounts | 1,500 | 1,500 | 1,500 | | ||
| | Contacts | 4,200 | 4,200 | 4,200 | | ||
| | Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* | | ||
| | Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` | | ||
| | Conversion rate (recipe band) | 24–61% | 12–31% | 4–12% | | ||
| | Conversion rate (median, seeds 42–46) | 42.67% | 21.60% | 8.40% | | ||
| | Signal strength | 0.90 | 0.70 | 0.50 | | ||
| | Noise scale | 0.10 | 0.30 | 0.55 | | ||
| | Missing rate | 2% | 8% | 18% | | ||
|
|
||
| \* `student_public` / `research_instructor`. Difficulty is modulated | ||
| by the simulation engine — signal strength on latent-trait weights, | ||
| Gaussian noise on float features, MCAR missingness, outlier rate — | ||
| not post-hoc label flipping. | ||
|
|
||
| ## The scenario | ||
|
|
||
| **Veridian Technologies** is a fictional Series B startup (Austin, US) | ||
| selling **Veridian Procure**, a procurement / AP automation SaaS, to | ||
| mid-market firms (200–2,000 employees) in the US and UK. The funnel | ||
| runs through inbound marketing (45%), SDR outbound (35%), and | ||
| partner referrals (20%); four personas drive deals (VP Finance, AP | ||
| Manager, IT Director, Procurement Manager). **Task:** predict whether | ||
| a lead converts (`closed_won`) within 90 days. ACV bands are | ||
| $18k–$120k. See | ||
| [`docs/release/generation_method.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md) | ||
| for the full DGP, and the deeper "what's modelled / approximate / not | ||
| modelled" breakdown that this README only summarises. | ||
|
|
||
| ## Public vs instructor: what's redacted | ||
|
|
||
| Filtering happens **during rendering**, not during simulation. The | ||
| redaction contract is single-sourced in | ||
| [`leadforge/validation/leakage_probes.py`](https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py); | ||
| the snapshot-safe writer and the validator import the same constants, | ||
| so they cannot drift apart. | ||
|
|
||
| | Source-of-truth constant | Public bundle treatment | | ||
| |---|---| | ||
| | `BANNED_LEAD_COLUMNS = ("converted_within_90_days", "conversion_timestamp")` | Dropped from `tables/leads.parquet` | | ||
| | `BANNED_OPP_COLUMNS = ("close_outcome", "closed_at")` | Dropped from `tables/opportunities.parquet` | | ||
| | `BANNED_TABLES = ("customers", "subscriptions")` | Omitted from public bundles | | ||
| | `SNAPSHOT_FILTERED_TABLES` (touches, sessions, sales_activities, opportunities) | Filtered per-lead by `lead_created_at + snapshot_day` | | ||
| | Snapshot redaction (`current_stage`, `is_sql`) | Stripped from `tasks/` splits and `tables/leads.parquet` | | ||
| | `total_touches_all` (deliberate trap) | **Retained in both modes**; flagged `leakage_risk=True` | | ||
|
|
||
| Each bundle's `manifest.json` records `relational_snapshot_safe`, | ||
| `redacted_columns`, and `snapshot_day`, so the bundle is | ||
| self-describing. | ||
|
|
||
| ## Calibration | ||
|
|
||
| Every realism / calibration / difficulty claim in this README is | ||
| backed by | ||
| [`validation/validation_report.md`](https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md), | ||
| regenerated by | ||
| [`scripts/validate_release_candidate.py`](https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py) | ||
| with bands declared in | ||
| [`docs/release/v1_acceptance_gates_bands.yaml`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml). | ||
| Headline cross-seed medians (seeds 42–46): | ||
|
|
||
| | Tier | LR AUC | AP | P@100 | Brier | | ||
| |---|---|---|---|---| | ||
| | intro | 0.879 | 0.761 | 0.80 | 0.130 | | ||
| | intermediate | 0.886 | 0.575 | 0.59 | 0.110 | | ||
| | advanced | 0.886 | 0.351 | 0.34 | 0.061 | | ||
|
|
||
| AP, P@100, conversion-rate, and lift orderings hold across the | ||
| intended difficulty axis (intro > intermediate > advanced). | ||
|
|
||
| ## Intended uses | ||
|
|
||
| - Teaching baseline lead-scoring on a flat snapshot. | ||
| - Teaching relational feature engineering against snapshot-safe tables. | ||
| - Teaching leakage detection (the `total_touches_all` trap is | ||
| designed to be discoverable). | ||
| - Teaching calibration, lift, P@K, value-aware ranking | ||
| (`expected_acv × P(convert)`), and cohort-shift evaluation. | ||
| - Comparing model families under a controlled DGP. | ||
|
|
||
| ## Out-of-scope uses | ||
|
|
||
| - **Production lead scoring.** The company, product, and customers are | ||
| fictional. | ||
| - **Vendor benchmarking / paper baselines.** Difficulty tiers are | ||
| calibrated for pedagogy, not cross-paper comparability. | ||
| - **Causal-inference research that requires recovery of the true DGP.** | ||
| The instructor companion exposes the hidden graph for teaching, not | ||
| designed counterfactuals. | ||
| - **Demographic / fairness research.** v1 does not model protected | ||
| attributes. | ||
|
|
||
| ## Known limitations | ||
|
|
||
| - **Difficulty signal on raw AUC is flat.** LR AUC is ~0.88 across | ||
| every tier. Difficulty is visible in AP, P@K, Brier, and value | ||
| capture. Treat AUC as a sanity check, not a difficulty signal. | ||
| - **GBM does not consistently beat LR (gate G7.4.4).** GBM−LR AUC delta | ||
| is slightly negative in every tier (intro −0.0045, intermediate | ||
| −0.0072, advanced −0.0133); v1's snapshot is dominated by linear | ||
| features. v2 will inject non-linear interactions in the simulator. | ||
| - **Channel signal is weak.** Per | ||
| [`docs/release/channel_signal_audit.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md), | ||
| out-of-sample univariate AUC of `lead_source` is ≈0.50–0.52 across | ||
| all tiers and the per-channel rate spread is ≤0.05. The simulator | ||
| does not encode channel-conditional probabilities; channel-conditional | ||
| encoding is post-v1 work. | ||
| - **Cohort-shift degradation is small.** v1 has no time-of-year drift | ||
| baked in; the cohort-shift gate (G6.4) is informational and will | ||
| bite in v2. | ||
|
|
||
| ## Composition | ||
|
|
||
| - **Entities.** Accounts, contacts, leads, touches, sessions, | ||
| sales_activities, opportunities (public); plus customers and | ||
| subscriptions (instructor only). Per-row counts per bundle live in | ||
| `manifest.json`. | ||
| - **Features.** 32 public columns grouped by analytical role in | ||
| [`docs/release/feature_dictionary.md`](https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md); | ||
| the per-bundle `feature_dictionary.csv` is the authoritative | ||
| machine-readable spec. | ||
| - **Label.** `converted_within_90_days` (boolean), event-derived from | ||
| the simulator. Never sampled directly. | ||
| - **Splits.** 70/15/15 train/valid/test, deterministic given seed; | ||
| recorded in `tasks/converted_within_90_days/task_manifest.json`. | ||
| - **Provenance.** Recipe `b2b_saas_procurement_v1`, seed 42, package | ||
| version stamped in `manifest.json`. | ||
|
|
||
| ## Maintenance, adversarial framing, license | ||
|
|
||
| We *want* the dataset to be broken. Issue templates ship under | ||
| `.github/ISSUE_TEMPLATE/` (Phase 6); the break-me guide lands as | ||
| `docs/release/break_me_guide.md` (PR 6.3). Once Phase 6 ships, | ||
| `docs/release/v2_decision_log.md` will track every accepted finding | ||
| and the design call that came from it. File issues at | ||
| [leadforge-dev/leadforge](https://github.com/leadforge-dev/leadforge); | ||
| PRs welcome. | ||
|
|
||
| | Field | Value | | ||
| |---|---| | ||
| | Generator | leadforge `1.0.0+` | | ||
| | Recipe | `b2b_saas_procurement_v1` | | ||
| | Canonical seed | 42 (cross-seed sweep: 42–46) | | ||
| | Bundle schema version | 5 | | ||
| | Format | Parquet (canonical) + CSV (convenience) | | ||
| | License | MIT — see [LICENSE](LICENSE) | | ||
|
|
||
| Verify integrity with `leadforge validate <bundle_dir>`; every file | ||
| is hashed in `manifest.json`. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| # leadforge dataset card | ||
|
|
||
| | Field | Value | | ||
| |---|---| | ||
| | Recipe | `b2b_saas_procurement_v1` | | ||
| | Package version | `1.0.0` | | ||
| | Seed | `42` | | ||
| | Exposure mode | `student_public` | | ||
| | Difficulty | `advanced` | | ||
| | Horizon | 90 days | | ||
| | Label window | 90 days | | ||
| | Feature snapshot window | 30 days (windowed) | | ||
|
|
||
| ## Narrative summary | ||
|
|
||
| **Vendor:** Veridian Technologies (Series B, founded 2017, Austin, US) | ||
|
|
||
| **Product:** Veridian Procure — Procurement & AP Automation. Deployment: cloud_saas. Pricing: per_seat_annual. ACV range: $18,000–$120,000. | ||
|
|
||
| **Target market:** 200–2000-employee firms in US, UK. Key industries: manufacturing, logistics, professional_services, healthcare_non_clinical. Average deal size: $42,000. Average sales cycle: 45 days. | ||
|
|
||
| **GTM motion:** inbound_marketing, sdr_outbound, partner_referral (45% inbound / 35% outbound / 20% partner). | ||
|
|
||
| **Buyer personas:** | ||
|
|
||
| - **vp_finance** (economic_buyer) — VP Finance, CFO… | ||
| - **ap_manager** (champion) — AP Manager, Accounts Payable Manager… | ||
| - **it_director** (technical_evaluator) — IT Director, CTO… | ||
| - **procurement_manager** (end_user) — Procurement Manager, Director of Procurement… | ||
|
|
||
| ## Primary task | ||
|
|
||
| **Task:** `converted_within_90_days` | ||
|
|
||
| **Label definition:** A lead is considered converted if a `closed_won` event is recorded within 90 days of the lead's snapshot anchor date. The label is event-derived — never sampled directly. All features are pre-anchor (leakage-free by construction). | ||
|
|
||
| ## Table inventory | ||
|
|
||
| | Table | Rows | | ||
| |---|---:| | ||
| | accounts | 1,500 | | ||
| | contacts | 4,200 | | ||
| | leads | 5,000 | | ||
| | touches | 38,208 | | ||
| | sessions | 9,942 | | ||
| | sales_activities | 19,995 | | ||
| | opportunities | 4,004 | | ||
|
|
||
| ## Feature categories | ||
|
|
||
| | Category | Count | Examples | | ||
| |---|---:|---| | ||
| | account | 6 | account_id, industry, region | | ||
| | contact | 4 | contact_id, role_function, seniority | | ||
| | lead_meta | 4 | lead_id, lead_created_at, lead_source | | ||
| | engagement | 11 | touch_count, inbound_touch_count, outbound_touch_count | | ||
| | sales | 6 | activity_count, days_since_last_touch, opportunity_created | | ||
| | target | 1 | | | ||
|
|
||
| **Leakage-flagged columns:** `total_touches_all`. See `feature_dictionary.csv` for details. | ||
|
|
||
| ## Suggested use cases | ||
|
|
||
| - Teaching binary classification on realistic CRM data | ||
| - Portfolio projects demonstrating end-to-end ML pipelines | ||
| - Benchmarking lead-scoring models under controlled signal/noise conditions | ||
| - Research on causal structure in funnel conversion data | ||
|
|
||
| ## Caveats | ||
|
|
||
| - This is **synthetic** data. It does not represent any real company, product, or market. | ||
| - The hidden world structure varies by motif family and stochastic rewiring; no two seeds produce the same DGP. | ||
| - The label is evaluated over the full 90-day window from lead creation; event-aggregate features (e.g. `touch_count`, `session_count`, `expected_acv`) observe only the first 30 days of that window. The deliberate exception is `total_touches_all`, which counts touches over the full 90-day horizon as a pedagogical leakage trap. | ||
| - In `student_public` mode, the latent world graph, mechanism summary, and full world spec are withheld. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.