LeadForge: Synthetic B2B Lead Scoring (v1)
Three-tier synthetic CRM funnel for leakage-aware lead scoring
-
@@ -150,7 +150,8 @@
LeadForge: Synthetic B2B Lead Scoring (v1)
LeadForge: Synthetic B2B Lead Scoring Dataset (leadforge-lead-scoring-v1)
A relational, reproducible, three-tier synthetic CRM dataset family for -teaching lead scoring at scale. Generated by +teaching lead scoring at scale. Created by +Shay Palachy Affek and generated by leadforge, an open-source Python framework for synthetic CRM/funnel data. The framework version is decoupled from the dataset version: the package @@ -158,16 +159,14 @@
LeadForge: Synthetic B2B Lead Scoring Dataset (leadforge-lead-scoring-
tag.
Why lead scoring matters in 2024–2026
Mid-market SaaS vendors entered 2024–2026 with growth slowing and
-customer-acquisition costs rising[^macro], so predicting which leads
-convert within a fixed window has moved from a marketing nicety to a
-survival skill. This dataset teaches that skill on a relational
-substrate, with the realistic confusions (snapshot-window discipline,
-leakage traps, channel signal weaker than vendor blogs imply) that
-students will hit when they finally get hands on real CRM data.
-[^macro]: Macroeconomic framing summarised in
-docs/external_review/summaries/gemini_v2_summary.md
-(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio
-rose materially in 2024).
+customer-acquisition costs rising (median public-SaaS growth 30%→25%
+from 2023 to 2025; New CAC Ratio rose materially in 2024), so
+predicting which leads convert within a fixed window has moved from
+a marketing nicety to a survival skill. This dataset teaches that
+skill on a relational substrate, with the realistic confusions
+(snapshot-window discipline, leakage traps, channel signal weaker than
+vendor blogs imply) that students will hit when they finally get hands
+on real CRM data.
What's inside
.
├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier
@@ -585,6 +584,10 @@ Maintenance, adversarial framing, license
Verify integrity with leadforge validate <bundle_dir>; every file
is hashed in manifest.json.
+Credits
+Created by Shay Palachy Affek.
+Dataset generated with leadforge (MIT).
+Profiles: HuggingFace · Kaggle · GitHub
Data Files (57 total)
@@ -718,37 +721,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -757,37 +760,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -796,37 +799,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -980,37 +983,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -1019,37 +1022,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -1058,37 +1061,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -1242,37 +1245,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -1281,37 +1284,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -1320,37 +1323,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
diff --git a/release/dataset-cover-image.png b/release/dataset-cover-image.png
index 912bb43..ee27b77 100644
Binary files a/release/dataset-cover-image.png and b/release/dataset-cover-image.png differ
diff --git a/release/huggingface/README.md b/release/huggingface/README.md
index 8d54ff2..b5a8384 100644
--- a/release/huggingface/README.md
+++ b/release/huggingface/README.md
@@ -1,6 +1,8 @@
---
pretty_name: 'LeadForge: Synthetic B2B Lead Scoring (v1)'
license: mit
+authors:
+ - shaypal5
language:
- en
task_categories:
diff --git a/release/kaggle/dataset-metadata.json b/release/kaggle/dataset-metadata.json
index 70731fc..10209bb 100644
--- a/release/kaggle/dataset-metadata.json
+++ b/release/kaggle/dataset-metadata.json
@@ -23,6 +23,7 @@
"resources": [
{
"description": "Intro tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.",
+ "name": "intro_lead_scoring",
"path": "intro/lead_scoring.csv",
"schema": {
"fields": [
@@ -191,134 +192,167 @@
},
{
"description": "Intro tier feature dictionary (canonical column spec).",
+ "name": "intro_feature_dictionary",
"path": "intro/feature_dictionary.csv"
},
{
"description": "Intro tier train split for `converted_within_90_days` (3,500 rows).",
+ "name": "intro_tasks_converted_within_90_days_train",
"path": "intro/tasks/converted_within_90_days/train.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -327,130 +361,162 @@
},
{
"description": "Intro tier valid split for `converted_within_90_days` (750 rows).",
+ "name": "intro_tasks_converted_within_90_days_valid",
"path": "intro/tasks/converted_within_90_days/valid.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -459,130 +525,162 @@
},
{
"description": "Intro tier test split for `converted_within_90_days` (750 rows).",
+ "name": "intro_tasks_converted_within_90_days_test",
"path": "intro/tasks/converted_within_90_days/test.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -591,6 +689,7 @@
},
{
"description": "Intro tier `accounts` relational table (1,500 rows) — snapshot-safe.",
+ "name": "intro_tables_accounts",
"path": "intro/tables/accounts.parquet",
"schema": {
"fields": [
@@ -639,6 +738,7 @@
},
{
"description": "Intro tier `contacts` relational table (4,200 rows) — snapshot-safe.",
+ "name": "intro_tables_contacts",
"path": "intro/tables/contacts.parquet",
"schema": {
"fields": [
@@ -687,6 +787,7 @@
},
{
"description": "Intro tier `leads` relational table (5,000 rows) — snapshot-safe.",
+ "name": "intro_tables_leads",
"path": "intro/tables/leads.parquet",
"schema": {
"fields": [
@@ -730,6 +831,7 @@
},
{
"description": "Intro tier `touches` relational table (38,561 rows) — snapshot-safe.",
+ "name": "intro_tables_touches",
"path": "intro/tables/touches.parquet",
"schema": {
"fields": [
@@ -773,6 +875,7 @@
},
{
"description": "Intro tier `sessions` relational table (10,171 rows) — snapshot-safe.",
+ "name": "intro_tables_sessions",
"path": "intro/tables/sessions.parquet",
"schema": {
"fields": [
@@ -821,6 +924,7 @@
},
{
"description": "Intro tier `sales_activities` relational table (21,358 rows) — snapshot-safe.",
+ "name": "intro_tables_sales_activities",
"path": "intro/tables/sales_activities.parquet",
"schema": {
"fields": [
@@ -859,6 +963,7 @@
},
{
"description": "Intro tier `opportunities` relational table (4,426 rows) — snapshot-safe.",
+ "name": "intro_tables_opportunities",
"path": "intro/tables/opportunities.parquet",
"schema": {
"fields": [
@@ -892,18 +997,22 @@
},
{
"description": "Intro tier auto-rendered dataset card.",
+ "name": "intro_dataset_card",
"path": "intro/dataset_card.md"
},
{
"description": "Intro tier headline metrics (cross-seed medians + spreads, difficulty knobs, JSON-path back-reference to validation_report.json).",
+ "name": "intro_metrics",
"path": "intro/metrics.json"
},
{
"description": "Intro tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).",
+ "name": "intro_manifest",
"path": "intro/manifest.json"
},
{
"description": "Intermediate tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.",
+ "name": "intermediate_lead_scoring",
"path": "intermediate/lead_scoring.csv",
"schema": {
"fields": [
@@ -1072,134 +1181,167 @@
},
{
"description": "Intermediate tier feature dictionary (canonical column spec).",
+ "name": "intermediate_feature_dictionary",
"path": "intermediate/feature_dictionary.csv"
},
{
"description": "Intermediate tier train split for `converted_within_90_days` (3,500 rows).",
+ "name": "intermediate_tasks_converted_within_90_days_train",
"path": "intermediate/tasks/converted_within_90_days/train.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -1208,130 +1350,162 @@
},
{
"description": "Intermediate tier valid split for `converted_within_90_days` (750 rows).",
+ "name": "intermediate_tasks_converted_within_90_days_valid",
"path": "intermediate/tasks/converted_within_90_days/valid.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -1340,130 +1514,162 @@
},
{
"description": "Intermediate tier test split for `converted_within_90_days` (750 rows).",
+ "name": "intermediate_tasks_converted_within_90_days_test",
"path": "intermediate/tasks/converted_within_90_days/test.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -1472,6 +1678,7 @@
},
{
"description": "Intermediate tier `accounts` relational table (1,500 rows) — snapshot-safe.",
+ "name": "intermediate_tables_accounts",
"path": "intermediate/tables/accounts.parquet",
"schema": {
"fields": [
@@ -1520,6 +1727,7 @@
},
{
"description": "Intermediate tier `contacts` relational table (4,200 rows) — snapshot-safe.",
+ "name": "intermediate_tables_contacts",
"path": "intermediate/tables/contacts.parquet",
"schema": {
"fields": [
@@ -1568,6 +1776,7 @@
},
{
"description": "Intermediate tier `leads` relational table (5,000 rows) — snapshot-safe.",
+ "name": "intermediate_tables_leads",
"path": "intermediate/tables/leads.parquet",
"schema": {
"fields": [
@@ -1611,6 +1820,7 @@
},
{
"description": "Intermediate tier `touches` relational table (38,724 rows) — snapshot-safe.",
+ "name": "intermediate_tables_touches",
"path": "intermediate/tables/touches.parquet",
"schema": {
"fields": [
@@ -1654,6 +1864,7 @@
},
{
"description": "Intermediate tier `sessions` relational table (10,012 rows) — snapshot-safe.",
+ "name": "intermediate_tables_sessions",
"path": "intermediate/tables/sessions.parquet",
"schema": {
"fields": [
@@ -1702,6 +1913,7 @@
},
{
"description": "Intermediate tier `sales_activities` relational table (20,679 rows) — snapshot-safe.",
+ "name": "intermediate_tables_sales_activities",
"path": "intermediate/tables/sales_activities.parquet",
"schema": {
"fields": [
@@ -1740,6 +1952,7 @@
},
{
"description": "Intermediate tier `opportunities` relational table (4,255 rows) — snapshot-safe.",
+ "name": "intermediate_tables_opportunities",
"path": "intermediate/tables/opportunities.parquet",
"schema": {
"fields": [
@@ -1773,18 +1986,22 @@
},
{
"description": "Intermediate tier auto-rendered dataset card.",
+ "name": "intermediate_dataset_card",
"path": "intermediate/dataset_card.md"
},
{
"description": "Intermediate tier headline metrics (cross-seed medians + spreads, difficulty knobs, JSON-path back-reference to validation_report.json).",
+ "name": "intermediate_metrics",
"path": "intermediate/metrics.json"
},
{
"description": "Intermediate tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).",
+ "name": "intermediate_manifest",
"path": "intermediate/manifest.json"
},
{
"description": "Advanced tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.",
+ "name": "advanced_lead_scoring",
"path": "advanced/lead_scoring.csv",
"schema": {
"fields": [
@@ -1953,134 +2170,167 @@
},
{
"description": "Advanced tier feature dictionary (canonical column spec).",
+ "name": "advanced_feature_dictionary",
"path": "advanced/feature_dictionary.csv"
},
{
"description": "Advanced tier train split for `converted_within_90_days` (3,500 rows).",
+ "name": "advanced_tasks_converted_within_90_days_train",
"path": "advanced/tasks/converted_within_90_days/train.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -2089,130 +2339,162 @@
},
{
"description": "Advanced tier valid split for `converted_within_90_days` (750 rows).",
+ "name": "advanced_tasks_converted_within_90_days_valid",
"path": "advanced/tasks/converted_within_90_days/valid.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -2221,130 +2503,162 @@
},
{
"description": "Advanced tier test split for `converted_within_90_days` (750 rows).",
+ "name": "advanced_tasks_converted_within_90_days_test",
"path": "advanced/tasks/converted_within_90_days/test.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -2353,6 +2667,7 @@
},
{
"description": "Advanced tier `accounts` relational table (1,500 rows) — snapshot-safe.",
+ "name": "advanced_tables_accounts",
"path": "advanced/tables/accounts.parquet",
"schema": {
"fields": [
@@ -2401,6 +2716,7 @@
},
{
"description": "Advanced tier `contacts` relational table (4,200 rows) — snapshot-safe.",
+ "name": "advanced_tables_contacts",
"path": "advanced/tables/contacts.parquet",
"schema": {
"fields": [
@@ -2449,6 +2765,7 @@
},
{
"description": "Advanced tier `leads` relational table (5,000 rows) — snapshot-safe.",
+ "name": "advanced_tables_leads",
"path": "advanced/tables/leads.parquet",
"schema": {
"fields": [
@@ -2492,6 +2809,7 @@
},
{
"description": "Advanced tier `touches` relational table (38,208 rows) — snapshot-safe.",
+ "name": "advanced_tables_touches",
"path": "advanced/tables/touches.parquet",
"schema": {
"fields": [
@@ -2535,6 +2853,7 @@
},
{
"description": "Advanced tier `sessions` relational table (9,942 rows) — snapshot-safe.",
+ "name": "advanced_tables_sessions",
"path": "advanced/tables/sessions.parquet",
"schema": {
"fields": [
@@ -2583,6 +2902,7 @@
},
{
"description": "Advanced tier `sales_activities` relational table (19,995 rows) — snapshot-safe.",
+ "name": "advanced_tables_sales_activities",
"path": "advanced/tables/sales_activities.parquet",
"schema": {
"fields": [
@@ -2621,6 +2941,7 @@
},
{
"description": "Advanced tier `opportunities` relational table (4,004 rows) — snapshot-safe.",
+ "name": "advanced_tables_opportunities",
"path": "advanced/tables/opportunities.parquet",
"schema": {
"fields": [
@@ -2654,62 +2975,77 @@
},
{
"description": "Advanced tier auto-rendered dataset card.",
+ "name": "advanced_dataset_card",
"path": "advanced/dataset_card.md"
},
{
"description": "Advanced tier headline metrics (cross-seed medians + spreads, difficulty knobs, JSON-path back-reference to validation_report.json).",
+ "name": "advanced_metrics",
"path": "advanced/metrics.json"
},
{
"description": "Advanced tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).",
+ "name": "advanced_manifest",
"path": "advanced/manifest.json"
},
{
"description": "Top-level cross-tier headline metrics (medians + spreads + cohort-shift + cross-tier ordering booleans). Machine-readable summary backing the README's Calibration table.",
+ "name": "metrics",
"path": "metrics.json"
},
{
"description": "Claims register (human-readable table). Rendered from `claims_register_source.yaml`.",
+ "name": "claims_register",
"path": "claims_register.md"
},
{
"description": "Claims register (machine-readable). Each numerical / structural claim in the README paired with its backing artifact and JSON / YAML path.",
+ "name": "claims_register",
"path": "claims_register.json"
},
{
"description": "Claims-register source YAML — hand-edited; `claims_register.{md,json}` are rendered from this.",
+ "name": "claims_register_source",
"path": "claims_register_source.yaml"
},
{
"description": "Vendoring guide for the docs/ subtree — explains that these files are mirrored copies of docs/release/ in the source repo, edits go in the source, and the sync script refuses to clobber locally-edited copies.",
+ "name": "docs_readme",
"path": "docs/README.md"
},
{
"description": "Adversarial-framing guide: nine breakage patterns (leakage, split contamination, ranking inversions, calibration drift) with worked-example detection recipes.",
+ "name": "docs_break_me_guide",
"path": "docs/break_me_guide.md"
},
{
"description": "Empirical backing for the 'channel signal is weak' claim — out-of-sample univariate AUCs of `lead_source` per tier.",
+ "name": "docs_channel_signal_audit",
"path": "docs/channel_signal_audit.md"
},
{
"description": "Long-form per-feature documentation grouped by analytical role; companion to the per-tier `feature_dictionary.csv` machine-readable spec.",
+ "name": "docs_feature_dictionary",
"path": "docs/feature_dictionary.md"
},
{
"description": "Generation method (DGP description) — what is and isn't modelled by the simulator.",
+ "name": "docs_generation_method",
"path": "docs/generation_method.md"
},
{
"description": "Per-column descriptions for the 7 public relational tables (and the 2 instructor-only ones) — surfaced into the schema-section of this page.",
+ "name": "docs_relational_table_schemas",
"path": "docs/relational_table_schemas.csv"
},
{
"description": "Operational acceptance bands per gate (G5–G8); the source-of-truth thresholds the validator checks against.",
+ "name": "docs_v1_acceptance_gates_bands",
"path": "docs/v1_acceptance_gates_bands.yaml"
},
{
"description": "Accepted-for-v2 findings register — issues flagged in v1 that are scoped to the v2 release.",
+ "name": "docs_v2_decision_log",
"path": "docs/v2_decision_log.md"
}
],
diff --git a/scripts/generate_cover_image.py b/scripts/generate_cover_image.py
index 5cff167..2b0d67e 100644
--- a/scripts/generate_cover_image.py
+++ b/scripts/generate_cover_image.py
@@ -37,6 +37,7 @@
from __future__ import annotations
import argparse
+import random
import sys
from collections.abc import Sequence
from dataclasses import dataclass
@@ -52,19 +53,28 @@
CANVAS_WIDTH: Final[int] = 1280
CANVAS_HEIGHT: Final[int] = 640
-LEFT_MARGIN: Final[int] = 80
+LEFT_MARGIN: Final[int] = 62
#: Background — deep navy.
-BACKGROUND: Final[tuple[int, int, int]] = (13, 27, 42)
+BACKGROUND: Final[tuple[int, int, int]] = (13, 17, 23)
#: Card background — slightly lighter navy.
-CARD_BACKGROUND: Final[tuple[int, int, int]] = (27, 38, 59)
+CARD_BACKGROUND: Final[tuple[int, int, int]] = (22, 27, 34)
#: Primary text colour — pure white.
TEXT_PRIMARY: Final[tuple[int, int, int]] = (255, 255, 255)
#: Secondary text colour — pale steel.
-TEXT_SECONDARY: Final[tuple[int, int, int]] = (200, 220, 240)
+TEXT_SECONDARY: Final[tuple[int, int, int]] = (148, 163, 184)
+#: Dimmer text for feature line.
+TEXT_DIM: Final[tuple[int, int, int]] = (74, 85, 104)
+#: Teal accent.
+ACCENT_TEAL: Final[tuple[int, int, int]] = (0, 212, 170)
DEFAULT_OUT_PATH: Final[Path] = Path("release/dataset-cover-image.png")
+#: x-centre of the funnel visualization (right panel).
+FUNNEL_CX: Final[int] = 976
+#: y-coordinate of the top edge of the funnel (top-down).
+FUNNEL_TOP: Final[int] = 90
+
@dataclass(frozen=True)
class TierBadge:
@@ -74,6 +84,8 @@ class TierBadge:
conversion_rate_pct: str
lr_auc: float
accent: tuple[int, int, int]
+ funnel_top_width: int
+ funnel_bot_width: int
#: Cross-seed medians (seeds 42-46) from
@@ -81,11 +93,13 @@ class TierBadge:
#: cover image is reproducible without reading the report at render
#: time.
TIER_BADGES: Final[tuple[TierBadge, ...]] = (
- TierBadge("Intro", "42.7%", 0.879, (76, 175, 80)),
- TierBadge("Intermediate", "21.6%", 0.886, (255, 152, 0)),
- TierBadge("Advanced", "8.4%", 0.886, (244, 67, 54)),
+ TierBadge("Intro", "42.7%", 0.879, (0, 212, 170), 320, 238),
+ TierBadge("Intermediate", "21.6%", 0.886, (245, 158, 11), 238, 156),
+ TierBadge("Advanced", "8.4%", 0.886, (239, 68, 68), 156, 60),
)
+FUNNEL_SECTION_HEIGHT: Final[int] = 140
+
# ---------------------------------------------------------------------------
# Font loading
@@ -107,32 +121,65 @@ def _find_font(family: str, *, weight: str = "normal") -> Path:
# ---------------------------------------------------------------------------
-# Drawing
+# Drawing helpers
# ---------------------------------------------------------------------------
+def _draw_bokeh_background(img: Image.Image, *, seed: int = 42) -> Image.Image:
+ """Overlay faint translucent circles giving a soft depth effect."""
+
+ overlay = Image.new("RGBA", img.size, (0, 0, 0, 0))
+ d = ImageDraw.Draw(overlay)
+ rng = random.Random(seed) # noqa: S311
+ palette = [(0, 212, 170), (74, 158, 234), (245, 158, 11)]
+ for _ in range(160):
+ cx = rng.randint(0, CANVAS_WIDTH)
+ cy = rng.randint(0, CANVAS_HEIGHT)
+ r = rng.randint(20, 90)
+ alpha = rng.randint(3, 12)
+ col = rng.choice(palette)
+ d.ellipse((cx - r, cy - r, cx + r, cy + r), fill=(*col, alpha))
+ # Corner glow: top-left teal, bottom-right amber
+ for gx, gy, gc, ga in [
+ (110, 100, (0, 212, 170), 18),
+ (1160, 540, (245, 158, 11), 14),
+ ]:
+ gr = 280
+ d.ellipse((gx - gr, gy - gr, gx + gr, gy + gr), fill=(*gc, ga))
+ base = img.convert("RGBA")
+ return Image.alpha_composite(base, overlay).convert("RGB")
+
+
+def _draw_vertical_accent(draw: ImageDraw.ImageDraw) -> None:
+ """Teal accent bar left of the title block."""
+
+ draw.rectangle((LEFT_MARGIN, 140, LEFT_MARGIN + 5, 490), fill=ACCENT_TEAL)
+
+
def _draw_title_block(draw: ImageDraw.ImageDraw, font_paths: dict[str, Path]) -> None:
"""Render the title, tagline, and subtitle text block."""
+ tx = LEFT_MARGIN + 18
+
title_font = ImageFont.truetype(str(font_paths["bold"]), 96)
- draw.text((LEFT_MARGIN, 88), "LeadForge", font=title_font, fill=TEXT_PRIMARY)
+ draw.text((tx, 160), "LeadForge", font=title_font, fill=TEXT_PRIMARY)
- tagline_font = ImageFont.truetype(str(font_paths["regular"]), 40)
+ tagline_font = ImageFont.truetype(str(font_paths["regular"]), 38)
draw.text(
- (LEFT_MARGIN, 208),
- "Synthetic B2B Lead Scoring · v1",
- font=tagline_font,
- fill=TEXT_SECONDARY,
+ (tx, 282), "Synthetic B2B Lead Scoring · v1", font=tagline_font, fill=TEXT_SECONDARY
)
- subtitle_font = ImageFont.truetype(str(font_paths["regular"]), 24)
+ subtitle_font = ImageFont.truetype(str(font_paths["regular"]), 22)
draw.text(
- (LEFT_MARGIN, 280),
- "5,000 leads · 3 difficulty tiers · 90-day conversion · MIT",
+ (tx, 340),
+ "5,000 leads · 3 difficulty tiers · 90-day conversion · MIT",
font=subtitle_font,
- fill=TEXT_SECONDARY,
+ fill=TEXT_DIM,
)
+ # Separator line
+ draw.line((LEFT_MARGIN, 398, 640, 398), fill=(30, 45, 61), width=1)
+
def _draw_tier_card(
draw: ImageDraw.ImageDraw,
@@ -145,58 +192,136 @@ def _draw_tier_card(
left, top, right, bottom = box
draw.rectangle((left, top, right, bottom), fill=CARD_BACKGROUND)
+ # Card border
+ draw.rectangle((left, top, right, bottom), outline=badge.accent, width=1)
# Coloured accent stripe down the left edge.
- draw.rectangle((left, top, left + 8, bottom), fill=badge.accent)
+ draw.rectangle((left, top, left + 5, bottom), fill=badge.accent)
- name_font = ImageFont.truetype(str(font_paths["bold"]), 36)
- draw.text((left + 32, top + 24), badge.name, font=name_font, fill=TEXT_PRIMARY)
+ name_font = ImageFont.truetype(str(font_paths["bold"]), 22)
+ draw.text((left + 16, top + 14), badge.name, font=name_font, fill=TEXT_PRIMARY)
- body_font = ImageFont.truetype(str(font_paths["regular"]), 22)
+ body_font = ImageFont.truetype(str(font_paths["regular"]), 17)
draw.text(
- (left + 32, top + 80),
- f"Conversion: {badge.conversion_rate_pct}",
+ (left + 16, top + 50),
+ f"Conv: {badge.conversion_rate_pct}",
font=body_font,
fill=TEXT_SECONDARY,
)
draw.text(
- (left + 32, top + 116),
- f"LR AUC: {badge.lr_auc:.3f}",
- font=body_font,
- fill=TEXT_SECONDARY,
+ (left + 16, top + 76), f"LR AUC: {badge.lr_auc:.3f}", font=body_font, fill=TEXT_SECONDARY
)
+def _draw_funnel(draw: ImageDraw.ImageDraw, font_paths: dict[str, Path]) -> None:
+ """Render the three-tier conversion funnel (right panel)."""
+
+ label_font = ImageFont.truetype(str(font_paths["regular"]), 18)
+ tier_font = ImageFont.truetype(str(font_paths["bold"]), 22)
+ pct_font = ImageFont.truetype(str(font_paths["bold"]), 18)
+
+ # "Conversion by Tier" heading above funnel
+ heading = "Conversion by Tier"
+ bbox = draw.textbbox((0, 0), heading, font=label_font)
+ hw = bbox[2] - bbox[0]
+ draw.text((FUNNEL_CX - hw // 2, FUNNEL_TOP - 28), heading, font=label_font, fill=TEXT_DIM)
+
+ y = FUNNEL_TOP
+ for badge in TIER_BADGES:
+ tw = badge.funnel_top_width
+ bw = badge.funnel_bot_width
+ h = FUNNEL_SECTION_HEIGHT
+ col = badge.accent
+ bot_y = y + h
+
+ # Trapezoid
+ pts = [
+ (FUNNEL_CX - tw // 2, y),
+ (FUNNEL_CX + tw // 2, y),
+ (FUNNEL_CX + bw // 2, bot_y),
+ (FUNNEL_CX - bw // 2, bot_y),
+ ]
+ draw.polygon(pts, fill=(*col, 210))
+
+ # Subtle highlight stripe at the top edge of each section
+ hl_pts = [
+ (FUNNEL_CX - tw // 2, y),
+ (FUNNEL_CX + tw // 2, y),
+ (FUNNEL_CX + tw // 2, y + 14),
+ (FUNNEL_CX - tw // 2, y + 14),
+ ]
+ draw.polygon(hl_pts, fill=(255, 255, 255, 18))
+
+ # Tier name centred inside section
+ mid_y = y + h // 2
+ name_bbox = draw.textbbox((0, 0), badge.name, font=tier_font)
+ nw = name_bbox[2] - name_bbox[0]
+ nh = name_bbox[3] - name_bbox[1]
+ draw.text(
+ (FUNNEL_CX - nw // 2, mid_y - nh // 2), badge.name, font=tier_font, fill=TEXT_PRIMARY
+ )
+
+ # Conversion % to the right of the widest top edge
+ pct_x = FUNNEL_CX + tw // 2 + 14
+ pct_bbox = draw.textbbox((0, 0), badge.conversion_rate_pct, font=pct_font)
+ ph = pct_bbox[3] - pct_bbox[1]
+ draw.text((pct_x, mid_y - ph // 2), badge.conversion_rate_pct, font=pct_font, fill=col)
+
+ y = bot_y
+
+ # "LR AUC ≈ 0.88" caption below funnel
+ caption = "LR AUC ≈ 0.88 across all tiers"
+ cbbox = draw.textbbox((0, 0), caption, font=label_font)
+ cw = cbbox[2] - cbbox[0]
+ draw.text((FUNNEL_CX - cw // 2, y + 16), caption, font=label_font, fill=TEXT_DIM)
+
+
+def _draw_bottom_strip(draw: ImageDraw.ImageDraw) -> None:
+ """Thin teal strip at the very bottom edge."""
+
+ draw.rectangle((0, CANVAS_HEIGHT - 5, CANVAS_WIDTH, CANVAS_HEIGHT), fill=(*ACCENT_TEAL, 128))
+
+
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+
+
def render_cover(badges: Sequence[TierBadge] = TIER_BADGES) -> Image.Image:
"""Render the cover image as a fresh ``PIL.Image`` instance."""
image = Image.new("RGB", (CANVAS_WIDTH, CANVAS_HEIGHT), BACKGROUND)
- draw = ImageDraw.Draw(image)
+
+ # Bokeh background (composite via RGBA)
+ image = _draw_bokeh_background(image)
+
+ draw = ImageDraw.Draw(image, "RGBA")
font_paths = {
"regular": _find_font("DejaVu Sans", weight="normal"),
"bold": _find_font("DejaVu Sans", weight="bold"),
}
+ _draw_vertical_accent(draw)
_draw_title_block(draw, font_paths)
+ _draw_funnel(draw, font_paths)
- # Three equal-width cards spanning the bottom half of the canvas.
- card_top = 400
- card_bottom = 580
+ # Three tier cards across the bottom-left panel
+ card_top = 420
+ card_bottom = 590
card_count = len(badges)
- gap = 40
- available = CANVAS_WIDTH - 2 * LEFT_MARGIN
+ gap = 16
+ available = 640 - LEFT_MARGIN - 10
card_width = (available - gap * (card_count - 1)) // card_count
for i, badge in enumerate(badges):
left = LEFT_MARGIN + i * (card_width + gap)
right = left + card_width
_draw_tier_card(
- draw,
- badge=badge,
- box=(left, card_top, right, card_bottom),
- font_paths=font_paths,
+ draw, badge=badge, box=(left, card_top, right, card_bottom), font_paths=font_paths
)
- return image
+ _draw_bottom_strip(draw)
+
+ return image.convert("RGB")
def write_cover(path: Path, image: Image.Image | None = None) -> Path:
diff --git a/scripts/package_hf_release.py b/scripts/package_hf_release.py
index 3361492..d6fa1b6 100644
--- a/scripts/package_hf_release.py
+++ b/scripts/package_hf_release.py
@@ -395,6 +395,7 @@ class HuggingFaceCard:
tags: tuple[str, ...]
configs: tuple[ConfigEntry, ...]
body: str
+ authors: tuple[str, ...] = ()
# ---------------------------------------------------------------------------
@@ -545,6 +546,7 @@ def build_card(
default_config: str = DEFAULT_DEFAULT_CONFIG,
task: str = DEFAULT_TASK,
body: str | None = None,
+ authors: Sequence[str] = ("shaypal5",),
) -> HuggingFaceCard:
"""Assemble a ``HuggingFaceCard`` from the release tree.
@@ -595,6 +597,7 @@ def build_card(
tags=tuple(tags),
configs=configs,
body=body,
+ authors=tuple(authors),
)
@@ -635,6 +638,7 @@ def _frontmatter_dict(card: HuggingFaceCard) -> dict[str, Any]:
return {
"pretty_name": card.pretty_name,
"license": card.license,
+ **({"authors": list(card.authors)} if card.authors else {}),
"language": list(card.language),
"task_categories": list(card.task_categories),
"size_categories": list(card.size_categories),
diff --git a/scripts/package_kaggle_release.py b/scripts/package_kaggle_release.py
index 520dcec..8db7c3a 100644
--- a/scripts/package_kaggle_release.py
+++ b/scripts/package_kaggle_release.py
@@ -292,6 +292,7 @@ class Resource:
path: str
description: str
schema: ResourceSchema | None = None
+ name: str = "" # if empty, _resource_to_dict derives a slug from path
@dataclass(frozen=True)
@@ -597,6 +598,13 @@ def build_tier_resources(
column_descriptions = load_relational_column_descriptions(release_dir)
+ # Build a (table_name, col_name) → description map for task-split Parquet files.
+ # Task splits carry the same snapshot-level columns as the flat CSV, so we
+ # re-key the feature dictionary under a synthetic table name "snapshot".
+ snapshot_col_descs: dict[tuple[str, str], str] = {
+ ("snapshot", col): fd.description for col, fd in feature_dict.items()
+ }
+
resources: list[Resource] = []
resources.append(
@@ -626,7 +634,13 @@ def build_tier_resources(
Resource(
path=f"{tier}/tasks/{task}/{split}.parquet",
description=(f"{tier.capitalize()} tier {split} split for `{task}` ({rows_str})."),
- schema=ResourceSchema(fields=fields_from_parquet(split_path)),
+ schema=ResourceSchema(
+ fields=fields_from_parquet(
+ split_path,
+ column_descriptions=snapshot_col_descs,
+ table_name="snapshot",
+ )
+ ),
)
)
@@ -852,7 +866,13 @@ def _resource_to_dict(resource: Resource) -> dict[str, Any]:
per-column documentation).
"""
+ name = resource.name or re.sub(
+ r"[^a-z0-9]+",
+ "_",
+ Path(resource.path).with_suffix("").as_posix().lower(),
+ ).strip("_")
payload: dict[str, Any] = {
+ "name": name,
"path": resource.path,
"description": resource.description,
}
diff --git a/tests/scripts/test_preview_kaggle_page.py b/tests/scripts/test_preview_kaggle_page.py
index 428fd6d..4673550 100644
--- a/tests/scripts/test_preview_kaggle_page.py
+++ b/tests/scripts/test_preview_kaggle_page.py
@@ -51,7 +51,12 @@
# stopped firing. The whitelist is intentionally narrow.
_LINK_OK_PREFIXES = (
"https://github.com/leadforge-dev/leadforge",
+ "https://github.com/shaypalachy",
"https://huggingface.co/datasets/leadforge",
+ "https://huggingface.co/datasets/shaypal5",
+ "https://huggingface.co/shaypal5",
+ "https://www.kaggle.com/derelictpanda",
+ "https://www.shaypalachy.com/",
"https://example.com", # used by unit tests only
"LICENSE", # sibling-relative, resolves under the upload tree
"#", # in-document anchor (footnotes, etc.)
Why lead scoring matters in 2024–2026
Mid-market SaaS vendors entered 2024–2026 with growth slowing and -customer-acquisition costs rising[^macro], so predicting which leads -convert within a fixed window has moved from a marketing nicety to a -survival skill. This dataset teaches that skill on a relational -substrate, with the realistic confusions (snapshot-window discipline, -leakage traps, channel signal weaker than vendor blogs imply) that -students will hit when they finally get hands on real CRM data.
-[^macro]: Macroeconomic framing summarised in
-docs/external_review/summaries/gemini_v2_summary.md
-(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio
-rose materially in 2024).
What's inside
.
├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier
@@ -585,6 +584,10 @@ Maintenance, adversarial framing, license
Verify integrity with leadforge validate <bundle_dir>; every file
is hashed in manifest.json.
+Credits
+Created by Shay Palachy Affek.
+Dataset generated with leadforge (MIT).
+Profiles: HuggingFace · Kaggle · GitHub
Data Files (57 total)
@@ -718,37 +721,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -757,37 +760,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -796,37 +799,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -980,37 +983,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -1019,37 +1022,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -1058,37 +1061,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -1242,37 +1245,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -1281,37 +1284,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
@@ -1320,37 +1323,37 @@ Schema / Columns (522
Column Type Description
- account_idstring
- industrystring
- regionstring
- employee_bandstring
- estimated_revenue_bandstring
- process_maturity_bandstring
- contact_idstring
- role_functionstring
- senioritystring
- buyer_rolestring
- lead_idstring
- lead_created_atstring
- lead_sourcestring
- touch_countnumber
- inbound_touch_countnumber
- outbound_touch_countnumber
- session_countnumber
- pricing_page_viewsnumber
- demo_page_viewsnumber
- total_session_duration_secondsnumber
- touches_days_0_7number
- touches_last_7_daysnumber
- days_since_first_touchnumber
- activity_countnumber
- days_since_last_touchnumber
- opportunity_createdboolean
- has_open_opportunityboolean
- opportunity_estimated_acvnumber
- expected_acvnumber
- total_touches_allinteger
- converted_within_90_daysboolean
+ account_idstring Opaque account identifier.
+ industrystring Industry vertical of the buying organization.
+ regionstring Geographic region of the account's headquarters.
+ employee_bandstring Banded employee headcount of the account.
+ estimated_revenue_bandstring Banded estimated annual revenue of the account.
+ process_maturity_bandstring Banded internal process maturity score (latent).
+ contact_idstring Opaque contact identifier.
+ role_functionstring Functional area of the primary contact (e.g. finance, ops).
+ senioritystring Seniority band of the primary contact.
+ buyer_rolestring Buyer role classification (economic_buyer, champion, etc.).
+ lead_idstring Opaque lead identifier.
+ lead_created_atstring ISO-8601 timestamp when the lead was created.
+ lead_sourcestring Origination source of the lead (e.g. inbound_form, sdr_outbound).
+ touch_countnumber Total number of marketing/sales touches recorded before snapshot.
+ inbound_touch_countnumber Number of inbound touches before snapshot.
+ outbound_touch_countnumber Number of outbound touches before snapshot.
+ session_countnumber Number of web/trial sessions recorded before snapshot.
+ pricing_page_viewsnumber Cumulative pricing page views across all sessions before snapshot.
+ demo_page_viewsnumber Cumulative demo page views across all sessions before snapshot.
+ total_session_duration_secondsnumber Sum of session durations (seconds) before snapshot.
+ touches_days_0_7number Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.
+ touches_last_7_daysnumber Number of touches in the last 7 days before snapshot cutoff.
+ days_since_first_touchnumber Days between first touch and snapshot cutoff (NaN if no touches).
+ activity_countnumber Number of sales activities logged before snapshot.
+ days_since_last_touchnumber Days elapsed between most recent touch and snapshot cutoff.
+ opportunity_createdboolean Whether any opportunity was created by snapshot date (open or closed).
+ has_open_opportunityboolean Whether an open opportunity existed at snapshot date.
+ opportunity_estimated_acvnumber Estimated ACV of the most recent open opportunity (NaN if none).
+ expected_acvnumber Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).
+ total_touches_allinteger Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.
+ converted_within_90_daysboolean Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.
diff --git a/release/dataset-cover-image.png b/release/dataset-cover-image.png
index 912bb43..ee27b77 100644
Binary files a/release/dataset-cover-image.png and b/release/dataset-cover-image.png differ
diff --git a/release/huggingface/README.md b/release/huggingface/README.md
index 8d54ff2..b5a8384 100644
--- a/release/huggingface/README.md
+++ b/release/huggingface/README.md
@@ -1,6 +1,8 @@
---
pretty_name: 'LeadForge: Synthetic B2B Lead Scoring (v1)'
license: mit
+authors:
+ - shaypal5
language:
- en
task_categories:
diff --git a/release/kaggle/dataset-metadata.json b/release/kaggle/dataset-metadata.json
index 70731fc..10209bb 100644
--- a/release/kaggle/dataset-metadata.json
+++ b/release/kaggle/dataset-metadata.json
@@ -23,6 +23,7 @@
"resources": [
{
"description": "Intro tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.",
+ "name": "intro_lead_scoring",
"path": "intro/lead_scoring.csv",
"schema": {
"fields": [
@@ -191,134 +192,167 @@
},
{
"description": "Intro tier feature dictionary (canonical column spec).",
+ "name": "intro_feature_dictionary",
"path": "intro/feature_dictionary.csv"
},
{
"description": "Intro tier train split for `converted_within_90_days` (3,500 rows).",
+ "name": "intro_tasks_converted_within_90_days_train",
"path": "intro/tasks/converted_within_90_days/train.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -327,130 +361,162 @@
},
{
"description": "Intro tier valid split for `converted_within_90_days` (750 rows).",
+ "name": "intro_tasks_converted_within_90_days_valid",
"path": "intro/tasks/converted_within_90_days/valid.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -459,130 +525,162 @@
},
{
"description": "Intro tier test split for `converted_within_90_days` (750 rows).",
+ "name": "intro_tasks_converted_within_90_days_test",
"path": "intro/tasks/converted_within_90_days/test.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -591,6 +689,7 @@
},
{
"description": "Intro tier `accounts` relational table (1,500 rows) — snapshot-safe.",
+ "name": "intro_tables_accounts",
"path": "intro/tables/accounts.parquet",
"schema": {
"fields": [
@@ -639,6 +738,7 @@
},
{
"description": "Intro tier `contacts` relational table (4,200 rows) — snapshot-safe.",
+ "name": "intro_tables_contacts",
"path": "intro/tables/contacts.parquet",
"schema": {
"fields": [
@@ -687,6 +787,7 @@
},
{
"description": "Intro tier `leads` relational table (5,000 rows) — snapshot-safe.",
+ "name": "intro_tables_leads",
"path": "intro/tables/leads.parquet",
"schema": {
"fields": [
@@ -730,6 +831,7 @@
},
{
"description": "Intro tier `touches` relational table (38,561 rows) — snapshot-safe.",
+ "name": "intro_tables_touches",
"path": "intro/tables/touches.parquet",
"schema": {
"fields": [
@@ -773,6 +875,7 @@
},
{
"description": "Intro tier `sessions` relational table (10,171 rows) — snapshot-safe.",
+ "name": "intro_tables_sessions",
"path": "intro/tables/sessions.parquet",
"schema": {
"fields": [
@@ -821,6 +924,7 @@
},
{
"description": "Intro tier `sales_activities` relational table (21,358 rows) — snapshot-safe.",
+ "name": "intro_tables_sales_activities",
"path": "intro/tables/sales_activities.parquet",
"schema": {
"fields": [
@@ -859,6 +963,7 @@
},
{
"description": "Intro tier `opportunities` relational table (4,426 rows) — snapshot-safe.",
+ "name": "intro_tables_opportunities",
"path": "intro/tables/opportunities.parquet",
"schema": {
"fields": [
@@ -892,18 +997,22 @@
},
{
"description": "Intro tier auto-rendered dataset card.",
+ "name": "intro_dataset_card",
"path": "intro/dataset_card.md"
},
{
"description": "Intro tier headline metrics (cross-seed medians + spreads, difficulty knobs, JSON-path back-reference to validation_report.json).",
+ "name": "intro_metrics",
"path": "intro/metrics.json"
},
{
"description": "Intro tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).",
+ "name": "intro_manifest",
"path": "intro/manifest.json"
},
{
"description": "Intermediate tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.",
+ "name": "intermediate_lead_scoring",
"path": "intermediate/lead_scoring.csv",
"schema": {
"fields": [
@@ -1072,134 +1181,167 @@
},
{
"description": "Intermediate tier feature dictionary (canonical column spec).",
+ "name": "intermediate_feature_dictionary",
"path": "intermediate/feature_dictionary.csv"
},
{
"description": "Intermediate tier train split for `converted_within_90_days` (3,500 rows).",
+ "name": "intermediate_tasks_converted_within_90_days_train",
"path": "intermediate/tasks/converted_within_90_days/train.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -1208,130 +1350,162 @@
},
{
"description": "Intermediate tier valid split for `converted_within_90_days` (750 rows).",
+ "name": "intermediate_tasks_converted_within_90_days_valid",
"path": "intermediate/tasks/converted_within_90_days/valid.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -1340,130 +1514,162 @@
},
{
"description": "Intermediate tier test split for `converted_within_90_days` (750 rows).",
+ "name": "intermediate_tasks_converted_within_90_days_test",
"path": "intermediate/tasks/converted_within_90_days/test.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -1472,6 +1678,7 @@
},
{
"description": "Intermediate tier `accounts` relational table (1,500 rows) — snapshot-safe.",
+ "name": "intermediate_tables_accounts",
"path": "intermediate/tables/accounts.parquet",
"schema": {
"fields": [
@@ -1520,6 +1727,7 @@
},
{
"description": "Intermediate tier `contacts` relational table (4,200 rows) — snapshot-safe.",
+ "name": "intermediate_tables_contacts",
"path": "intermediate/tables/contacts.parquet",
"schema": {
"fields": [
@@ -1568,6 +1776,7 @@
},
{
"description": "Intermediate tier `leads` relational table (5,000 rows) — snapshot-safe.",
+ "name": "intermediate_tables_leads",
"path": "intermediate/tables/leads.parquet",
"schema": {
"fields": [
@@ -1611,6 +1820,7 @@
},
{
"description": "Intermediate tier `touches` relational table (38,724 rows) — snapshot-safe.",
+ "name": "intermediate_tables_touches",
"path": "intermediate/tables/touches.parquet",
"schema": {
"fields": [
@@ -1654,6 +1864,7 @@
},
{
"description": "Intermediate tier `sessions` relational table (10,012 rows) — snapshot-safe.",
+ "name": "intermediate_tables_sessions",
"path": "intermediate/tables/sessions.parquet",
"schema": {
"fields": [
@@ -1702,6 +1913,7 @@
},
{
"description": "Intermediate tier `sales_activities` relational table (20,679 rows) — snapshot-safe.",
+ "name": "intermediate_tables_sales_activities",
"path": "intermediate/tables/sales_activities.parquet",
"schema": {
"fields": [
@@ -1740,6 +1952,7 @@
},
{
"description": "Intermediate tier `opportunities` relational table (4,255 rows) — snapshot-safe.",
+ "name": "intermediate_tables_opportunities",
"path": "intermediate/tables/opportunities.parquet",
"schema": {
"fields": [
@@ -1773,18 +1986,22 @@
},
{
"description": "Intermediate tier auto-rendered dataset card.",
+ "name": "intermediate_dataset_card",
"path": "intermediate/dataset_card.md"
},
{
"description": "Intermediate tier headline metrics (cross-seed medians + spreads, difficulty knobs, JSON-path back-reference to validation_report.json).",
+ "name": "intermediate_metrics",
"path": "intermediate/metrics.json"
},
{
"description": "Intermediate tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).",
+ "name": "intermediate_manifest",
"path": "intermediate/manifest.json"
},
{
"description": "Advanced tier flat CSV (all splits concatenated, label retained, snapshot_day=30). The `split` column distinguishes train/valid/test rows.",
+ "name": "advanced_lead_scoring",
"path": "advanced/lead_scoring.csv",
"schema": {
"fields": [
@@ -1953,134 +2170,167 @@
},
{
"description": "Advanced tier feature dictionary (canonical column spec).",
+ "name": "advanced_feature_dictionary",
"path": "advanced/feature_dictionary.csv"
},
{
"description": "Advanced tier train split for `converted_within_90_days` (3,500 rows).",
+ "name": "advanced_tasks_converted_within_90_days_train",
"path": "advanced/tasks/converted_within_90_days/train.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -2089,130 +2339,162 @@
},
{
"description": "Advanced tier valid split for `converted_within_90_days` (750 rows).",
+ "name": "advanced_tasks_converted_within_90_days_valid",
"path": "advanced/tasks/converted_within_90_days/valid.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -2221,130 +2503,162 @@
},
{
"description": "Advanced tier test split for `converted_within_90_days` (750 rows).",
+ "name": "advanced_tasks_converted_within_90_days_test",
"path": "advanced/tasks/converted_within_90_days/test.parquet",
"schema": {
"fields": [
{
+ "description": "Opaque account identifier.",
"name": "account_id",
"type": "string"
},
{
+ "description": "Industry vertical of the buying organization.",
"name": "industry",
"type": "string"
},
{
+ "description": "Geographic region of the account's headquarters.",
"name": "region",
"type": "string"
},
{
+ "description": "Banded employee headcount of the account.",
"name": "employee_band",
"type": "string"
},
{
+ "description": "Banded estimated annual revenue of the account.",
"name": "estimated_revenue_band",
"type": "string"
},
{
+ "description": "Banded internal process maturity score (latent).",
"name": "process_maturity_band",
"type": "string"
},
{
+ "description": "Opaque contact identifier.",
"name": "contact_id",
"type": "string"
},
{
+ "description": "Functional area of the primary contact (e.g. finance, ops).",
"name": "role_function",
"type": "string"
},
{
+ "description": "Seniority band of the primary contact.",
"name": "seniority",
"type": "string"
},
{
+ "description": "Buyer role classification (economic_buyer, champion, etc.).",
"name": "buyer_role",
"type": "string"
},
{
+ "description": "Opaque lead identifier.",
"name": "lead_id",
"type": "string"
},
{
+ "description": "ISO-8601 timestamp when the lead was created.",
"name": "lead_created_at",
"type": "string"
},
{
+ "description": "Origination source of the lead (e.g. inbound_form, sdr_outbound).",
"name": "lead_source",
"type": "string"
},
{
+ "description": "Total number of marketing/sales touches recorded before snapshot.",
"name": "touch_count",
"type": "number"
},
{
+ "description": "Number of inbound touches before snapshot.",
"name": "inbound_touch_count",
"type": "number"
},
{
+ "description": "Number of outbound touches before snapshot.",
"name": "outbound_touch_count",
"type": "number"
},
{
+ "description": "Number of web/trial sessions recorded before snapshot.",
"name": "session_count",
"type": "number"
},
{
+ "description": "Cumulative pricing page views across all sessions before snapshot.",
"name": "pricing_page_views",
"type": "number"
},
{
+ "description": "Cumulative demo page views across all sessions before snapshot.",
"name": "demo_page_views",
"type": "number"
},
{
+ "description": "Sum of session durations (seconds) before snapshot.",
"name": "total_session_duration_seconds",
"type": "number"
},
{
+ "description": "Number of touches in days 0–7 (inclusive) after lead creation. Renamed from touches_week_1 for precision: the window covers 8 days (0, 1, …, 7), not 7.",
"name": "touches_days_0_7",
"type": "number"
},
{
+ "description": "Number of touches in the last 7 days before snapshot cutoff.",
"name": "touches_last_7_days",
"type": "number"
},
{
+ "description": "Days between first touch and snapshot cutoff (NaN if no touches).",
"name": "days_since_first_touch",
"type": "number"
},
{
+ "description": "Number of sales activities logged before snapshot.",
"name": "activity_count",
"type": "number"
},
{
+ "description": "Days elapsed between most recent touch and snapshot cutoff.",
"name": "days_since_last_touch",
"type": "number"
},
{
+ "description": "Whether any opportunity was created by snapshot date (open or closed).",
"name": "opportunity_created",
"type": "boolean"
},
{
+ "description": "Whether an open opportunity existed at snapshot date.",
"name": "has_open_opportunity",
"type": "boolean"
},
{
+ "description": "Estimated ACV of the most recent open opportunity (NaN if none).",
"name": "opportunity_estimated_acv",
"type": "number"
},
{
+ "description": "Expected ACV: opportunity ACV if available by snapshot, else revenue band midpoint heuristic (NaN if neither available).",
"name": "expected_acv",
"type": "number"
},
{
+ "description": "Total touches over full 90-day window. LEAKAGE TRAP: uses post-snapshot data. Included for pedagogical purposes only.",
"name": "total_touches_all",
"type": "integer"
},
{
+ "description": "Label: True if a closed_won event occurred within 90 days of the snapshot anchor date. Derived from simulated events.",
"name": "converted_within_90_days",
"type": "boolean"
}
@@ -2353,6 +2667,7 @@
},
{
"description": "Advanced tier `accounts` relational table (1,500 rows) — snapshot-safe.",
+ "name": "advanced_tables_accounts",
"path": "advanced/tables/accounts.parquet",
"schema": {
"fields": [
@@ -2401,6 +2716,7 @@
},
{
"description": "Advanced tier `contacts` relational table (4,200 rows) — snapshot-safe.",
+ "name": "advanced_tables_contacts",
"path": "advanced/tables/contacts.parquet",
"schema": {
"fields": [
@@ -2449,6 +2765,7 @@
},
{
"description": "Advanced tier `leads` relational table (5,000 rows) — snapshot-safe.",
+ "name": "advanced_tables_leads",
"path": "advanced/tables/leads.parquet",
"schema": {
"fields": [
@@ -2492,6 +2809,7 @@
},
{
"description": "Advanced tier `touches` relational table (38,208 rows) — snapshot-safe.",
+ "name": "advanced_tables_touches",
"path": "advanced/tables/touches.parquet",
"schema": {
"fields": [
@@ -2535,6 +2853,7 @@
},
{
"description": "Advanced tier `sessions` relational table (9,942 rows) — snapshot-safe.",
+ "name": "advanced_tables_sessions",
"path": "advanced/tables/sessions.parquet",
"schema": {
"fields": [
@@ -2583,6 +2902,7 @@
},
{
"description": "Advanced tier `sales_activities` relational table (19,995 rows) — snapshot-safe.",
+ "name": "advanced_tables_sales_activities",
"path": "advanced/tables/sales_activities.parquet",
"schema": {
"fields": [
@@ -2621,6 +2941,7 @@
},
{
"description": "Advanced tier `opportunities` relational table (4,004 rows) — snapshot-safe.",
+ "name": "advanced_tables_opportunities",
"path": "advanced/tables/opportunities.parquet",
"schema": {
"fields": [
@@ -2654,62 +2975,77 @@
},
{
"description": "Advanced tier auto-rendered dataset card.",
+ "name": "advanced_dataset_card",
"path": "advanced/dataset_card.md"
},
{
"description": "Advanced tier headline metrics (cross-seed medians + spreads, difficulty knobs, JSON-path back-reference to validation_report.json).",
+ "name": "advanced_metrics",
"path": "advanced/metrics.json"
},
{
"description": "Advanced tier provenance manifest (recipe, seed, package version, file hashes, snapshot_day, redaction contract).",
+ "name": "advanced_manifest",
"path": "advanced/manifest.json"
},
{
"description": "Top-level cross-tier headline metrics (medians + spreads + cohort-shift + cross-tier ordering booleans). Machine-readable summary backing the README's Calibration table.",
+ "name": "metrics",
"path": "metrics.json"
},
{
"description": "Claims register (human-readable table). Rendered from `claims_register_source.yaml`.",
+ "name": "claims_register",
"path": "claims_register.md"
},
{
"description": "Claims register (machine-readable). Each numerical / structural claim in the README paired with its backing artifact and JSON / YAML path.",
+ "name": "claims_register",
"path": "claims_register.json"
},
{
"description": "Claims-register source YAML — hand-edited; `claims_register.{md,json}` are rendered from this.",
+ "name": "claims_register_source",
"path": "claims_register_source.yaml"
},
{
"description": "Vendoring guide for the docs/ subtree — explains that these files are mirrored copies of docs/release/ in the source repo, edits go in the source, and the sync script refuses to clobber locally-edited copies.",
+ "name": "docs_readme",
"path": "docs/README.md"
},
{
"description": "Adversarial-framing guide: nine breakage patterns (leakage, split contamination, ranking inversions, calibration drift) with worked-example detection recipes.",
+ "name": "docs_break_me_guide",
"path": "docs/break_me_guide.md"
},
{
"description": "Empirical backing for the 'channel signal is weak' claim — out-of-sample univariate AUCs of `lead_source` per tier.",
+ "name": "docs_channel_signal_audit",
"path": "docs/channel_signal_audit.md"
},
{
"description": "Long-form per-feature documentation grouped by analytical role; companion to the per-tier `feature_dictionary.csv` machine-readable spec.",
+ "name": "docs_feature_dictionary",
"path": "docs/feature_dictionary.md"
},
{
"description": "Generation method (DGP description) — what is and isn't modelled by the simulator.",
+ "name": "docs_generation_method",
"path": "docs/generation_method.md"
},
{
"description": "Per-column descriptions for the 7 public relational tables (and the 2 instructor-only ones) — surfaced into the schema-section of this page.",
+ "name": "docs_relational_table_schemas",
"path": "docs/relational_table_schemas.csv"
},
{
"description": "Operational acceptance bands per gate (G5–G8); the source-of-truth thresholds the validator checks against.",
+ "name": "docs_v1_acceptance_gates_bands",
"path": "docs/v1_acceptance_gates_bands.yaml"
},
{
"description": "Accepted-for-v2 findings register — issues flagged in v1 that are scoped to the v2 release.",
+ "name": "docs_v2_decision_log",
"path": "docs/v2_decision_log.md"
}
],
diff --git a/scripts/generate_cover_image.py b/scripts/generate_cover_image.py
index 5cff167..2b0d67e 100644
--- a/scripts/generate_cover_image.py
+++ b/scripts/generate_cover_image.py
@@ -37,6 +37,7 @@
from __future__ import annotations
import argparse
+import random
import sys
from collections.abc import Sequence
from dataclasses import dataclass
@@ -52,19 +53,28 @@
CANVAS_WIDTH: Final[int] = 1280
CANVAS_HEIGHT: Final[int] = 640
-LEFT_MARGIN: Final[int] = 80
+LEFT_MARGIN: Final[int] = 62
#: Background — deep navy.
-BACKGROUND: Final[tuple[int, int, int]] = (13, 27, 42)
+BACKGROUND: Final[tuple[int, int, int]] = (13, 17, 23)
#: Card background — slightly lighter navy.
-CARD_BACKGROUND: Final[tuple[int, int, int]] = (27, 38, 59)
+CARD_BACKGROUND: Final[tuple[int, int, int]] = (22, 27, 34)
#: Primary text colour — pure white.
TEXT_PRIMARY: Final[tuple[int, int, int]] = (255, 255, 255)
#: Secondary text colour — pale steel.
-TEXT_SECONDARY: Final[tuple[int, int, int]] = (200, 220, 240)
+TEXT_SECONDARY: Final[tuple[int, int, int]] = (148, 163, 184)
+#: Dimmer text for feature line.
+TEXT_DIM: Final[tuple[int, int, int]] = (74, 85, 104)
+#: Teal accent.
+ACCENT_TEAL: Final[tuple[int, int, int]] = (0, 212, 170)
DEFAULT_OUT_PATH: Final[Path] = Path("release/dataset-cover-image.png")
+#: x-centre of the funnel visualization (right panel).
+FUNNEL_CX: Final[int] = 976
+#: y-coordinate of the top edge of the funnel (top-down).
+FUNNEL_TOP: Final[int] = 90
+
@dataclass(frozen=True)
class TierBadge:
@@ -74,6 +84,8 @@ class TierBadge:
conversion_rate_pct: str
lr_auc: float
accent: tuple[int, int, int]
+ funnel_top_width: int
+ funnel_bot_width: int
#: Cross-seed medians (seeds 42-46) from
@@ -81,11 +93,13 @@ class TierBadge:
#: cover image is reproducible without reading the report at render
#: time.
TIER_BADGES: Final[tuple[TierBadge, ...]] = (
- TierBadge("Intro", "42.7%", 0.879, (76, 175, 80)),
- TierBadge("Intermediate", "21.6%", 0.886, (255, 152, 0)),
- TierBadge("Advanced", "8.4%", 0.886, (244, 67, 54)),
+ TierBadge("Intro", "42.7%", 0.879, (0, 212, 170), 320, 238),
+ TierBadge("Intermediate", "21.6%", 0.886, (245, 158, 11), 238, 156),
+ TierBadge("Advanced", "8.4%", 0.886, (239, 68, 68), 156, 60),
)
+FUNNEL_SECTION_HEIGHT: Final[int] = 140
+
# ---------------------------------------------------------------------------
# Font loading
@@ -107,32 +121,65 @@ def _find_font(family: str, *, weight: str = "normal") -> Path:
# ---------------------------------------------------------------------------
-# Drawing
+# Drawing helpers
# ---------------------------------------------------------------------------
+def _draw_bokeh_background(img: Image.Image, *, seed: int = 42) -> Image.Image:
+ """Overlay faint translucent circles giving a soft depth effect."""
+
+ overlay = Image.new("RGBA", img.size, (0, 0, 0, 0))
+ d = ImageDraw.Draw(overlay)
+ rng = random.Random(seed) # noqa: S311
+ palette = [(0, 212, 170), (74, 158, 234), (245, 158, 11)]
+ for _ in range(160):
+ cx = rng.randint(0, CANVAS_WIDTH)
+ cy = rng.randint(0, CANVAS_HEIGHT)
+ r = rng.randint(20, 90)
+ alpha = rng.randint(3, 12)
+ col = rng.choice(palette)
+ d.ellipse((cx - r, cy - r, cx + r, cy + r), fill=(*col, alpha))
+ # Corner glow: top-left teal, bottom-right amber
+ for gx, gy, gc, ga in [
+ (110, 100, (0, 212, 170), 18),
+ (1160, 540, (245, 158, 11), 14),
+ ]:
+ gr = 280
+ d.ellipse((gx - gr, gy - gr, gx + gr, gy + gr), fill=(*gc, ga))
+ base = img.convert("RGBA")
+ return Image.alpha_composite(base, overlay).convert("RGB")
+
+
+def _draw_vertical_accent(draw: ImageDraw.ImageDraw) -> None:
+ """Teal accent bar left of the title block."""
+
+ draw.rectangle((LEFT_MARGIN, 140, LEFT_MARGIN + 5, 490), fill=ACCENT_TEAL)
+
+
def _draw_title_block(draw: ImageDraw.ImageDraw, font_paths: dict[str, Path]) -> None:
"""Render the title, tagline, and subtitle text block."""
+ tx = LEFT_MARGIN + 18
+
title_font = ImageFont.truetype(str(font_paths["bold"]), 96)
- draw.text((LEFT_MARGIN, 88), "LeadForge", font=title_font, fill=TEXT_PRIMARY)
+ draw.text((tx, 160), "LeadForge", font=title_font, fill=TEXT_PRIMARY)
- tagline_font = ImageFont.truetype(str(font_paths["regular"]), 40)
+ tagline_font = ImageFont.truetype(str(font_paths["regular"]), 38)
draw.text(
- (LEFT_MARGIN, 208),
- "Synthetic B2B Lead Scoring · v1",
- font=tagline_font,
- fill=TEXT_SECONDARY,
+ (tx, 282), "Synthetic B2B Lead Scoring · v1", font=tagline_font, fill=TEXT_SECONDARY
)
- subtitle_font = ImageFont.truetype(str(font_paths["regular"]), 24)
+ subtitle_font = ImageFont.truetype(str(font_paths["regular"]), 22)
draw.text(
- (LEFT_MARGIN, 280),
- "5,000 leads · 3 difficulty tiers · 90-day conversion · MIT",
+ (tx, 340),
+ "5,000 leads · 3 difficulty tiers · 90-day conversion · MIT",
font=subtitle_font,
- fill=TEXT_SECONDARY,
+ fill=TEXT_DIM,
)
+ # Separator line
+ draw.line((LEFT_MARGIN, 398, 640, 398), fill=(30, 45, 61), width=1)
+
def _draw_tier_card(
draw: ImageDraw.ImageDraw,
@@ -145,58 +192,136 @@ def _draw_tier_card(
left, top, right, bottom = box
draw.rectangle((left, top, right, bottom), fill=CARD_BACKGROUND)
+ # Card border
+ draw.rectangle((left, top, right, bottom), outline=badge.accent, width=1)
# Coloured accent stripe down the left edge.
- draw.rectangle((left, top, left + 8, bottom), fill=badge.accent)
+ draw.rectangle((left, top, left + 5, bottom), fill=badge.accent)
- name_font = ImageFont.truetype(str(font_paths["bold"]), 36)
- draw.text((left + 32, top + 24), badge.name, font=name_font, fill=TEXT_PRIMARY)
+ name_font = ImageFont.truetype(str(font_paths["bold"]), 22)
+ draw.text((left + 16, top + 14), badge.name, font=name_font, fill=TEXT_PRIMARY)
- body_font = ImageFont.truetype(str(font_paths["regular"]), 22)
+ body_font = ImageFont.truetype(str(font_paths["regular"]), 17)
draw.text(
- (left + 32, top + 80),
- f"Conversion: {badge.conversion_rate_pct}",
+ (left + 16, top + 50),
+ f"Conv: {badge.conversion_rate_pct}",
font=body_font,
fill=TEXT_SECONDARY,
)
draw.text(
- (left + 32, top + 116),
- f"LR AUC: {badge.lr_auc:.3f}",
- font=body_font,
- fill=TEXT_SECONDARY,
+ (left + 16, top + 76), f"LR AUC: {badge.lr_auc:.3f}", font=body_font, fill=TEXT_SECONDARY
)
+def _draw_funnel(draw: ImageDraw.ImageDraw, font_paths: dict[str, Path]) -> None:
+ """Render the three-tier conversion funnel (right panel)."""
+
+ label_font = ImageFont.truetype(str(font_paths["regular"]), 18)
+ tier_font = ImageFont.truetype(str(font_paths["bold"]), 22)
+ pct_font = ImageFont.truetype(str(font_paths["bold"]), 18)
+
+ # "Conversion by Tier" heading above funnel
+ heading = "Conversion by Tier"
+ bbox = draw.textbbox((0, 0), heading, font=label_font)
+ hw = bbox[2] - bbox[0]
+ draw.text((FUNNEL_CX - hw // 2, FUNNEL_TOP - 28), heading, font=label_font, fill=TEXT_DIM)
+
+ y = FUNNEL_TOP
+ for badge in TIER_BADGES:
+ tw = badge.funnel_top_width
+ bw = badge.funnel_bot_width
+ h = FUNNEL_SECTION_HEIGHT
+ col = badge.accent
+ bot_y = y + h
+
+ # Trapezoid
+ pts = [
+ (FUNNEL_CX - tw // 2, y),
+ (FUNNEL_CX + tw // 2, y),
+ (FUNNEL_CX + bw // 2, bot_y),
+ (FUNNEL_CX - bw // 2, bot_y),
+ ]
+ draw.polygon(pts, fill=(*col, 210))
+
+ # Subtle highlight stripe at the top edge of each section
+ hl_pts = [
+ (FUNNEL_CX - tw // 2, y),
+ (FUNNEL_CX + tw // 2, y),
+ (FUNNEL_CX + tw // 2, y + 14),
+ (FUNNEL_CX - tw // 2, y + 14),
+ ]
+ draw.polygon(hl_pts, fill=(255, 255, 255, 18))
+
+ # Tier name centred inside section
+ mid_y = y + h // 2
+ name_bbox = draw.textbbox((0, 0), badge.name, font=tier_font)
+ nw = name_bbox[2] - name_bbox[0]
+ nh = name_bbox[3] - name_bbox[1]
+ draw.text(
+ (FUNNEL_CX - nw // 2, mid_y - nh // 2), badge.name, font=tier_font, fill=TEXT_PRIMARY
+ )
+
+ # Conversion % to the right of the widest top edge
+ pct_x = FUNNEL_CX + tw // 2 + 14
+ pct_bbox = draw.textbbox((0, 0), badge.conversion_rate_pct, font=pct_font)
+ ph = pct_bbox[3] - pct_bbox[1]
+ draw.text((pct_x, mid_y - ph // 2), badge.conversion_rate_pct, font=pct_font, fill=col)
+
+ y = bot_y
+
+ # "LR AUC ≈ 0.88" caption below funnel
+ caption = "LR AUC ≈ 0.88 across all tiers"
+ cbbox = draw.textbbox((0, 0), caption, font=label_font)
+ cw = cbbox[2] - cbbox[0]
+ draw.text((FUNNEL_CX - cw // 2, y + 16), caption, font=label_font, fill=TEXT_DIM)
+
+
+def _draw_bottom_strip(draw: ImageDraw.ImageDraw) -> None:
+ """Thin teal strip at the very bottom edge."""
+
+ draw.rectangle((0, CANVAS_HEIGHT - 5, CANVAS_WIDTH, CANVAS_HEIGHT), fill=(*ACCENT_TEAL, 128))
+
+
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+
+
def render_cover(badges: Sequence[TierBadge] = TIER_BADGES) -> Image.Image:
"""Render the cover image as a fresh ``PIL.Image`` instance."""
image = Image.new("RGB", (CANVAS_WIDTH, CANVAS_HEIGHT), BACKGROUND)
- draw = ImageDraw.Draw(image)
+
+ # Bokeh background (composite via RGBA)
+ image = _draw_bokeh_background(image)
+
+ draw = ImageDraw.Draw(image, "RGBA")
font_paths = {
"regular": _find_font("DejaVu Sans", weight="normal"),
"bold": _find_font("DejaVu Sans", weight="bold"),
}
+ _draw_vertical_accent(draw)
_draw_title_block(draw, font_paths)
+ _draw_funnel(draw, font_paths)
- # Three equal-width cards spanning the bottom half of the canvas.
- card_top = 400
- card_bottom = 580
+ # Three tier cards across the bottom-left panel
+ card_top = 420
+ card_bottom = 590
card_count = len(badges)
- gap = 40
- available = CANVAS_WIDTH - 2 * LEFT_MARGIN
+ gap = 16
+ available = 640 - LEFT_MARGIN - 10
card_width = (available - gap * (card_count - 1)) // card_count
for i, badge in enumerate(badges):
left = LEFT_MARGIN + i * (card_width + gap)
right = left + card_width
_draw_tier_card(
- draw,
- badge=badge,
- box=(left, card_top, right, card_bottom),
- font_paths=font_paths,
+ draw, badge=badge, box=(left, card_top, right, card_bottom), font_paths=font_paths
)
- return image
+ _draw_bottom_strip(draw)
+
+ return image.convert("RGB")
def write_cover(path: Path, image: Image.Image | None = None) -> Path:
diff --git a/scripts/package_hf_release.py b/scripts/package_hf_release.py
index 3361492..d6fa1b6 100644
--- a/scripts/package_hf_release.py
+++ b/scripts/package_hf_release.py
@@ -395,6 +395,7 @@ class HuggingFaceCard:
tags: tuple[str, ...]
configs: tuple[ConfigEntry, ...]
body: str
+ authors: tuple[str, ...] = ()
# ---------------------------------------------------------------------------
@@ -545,6 +546,7 @@ def build_card(
default_config: str = DEFAULT_DEFAULT_CONFIG,
task: str = DEFAULT_TASK,
body: str | None = None,
+ authors: Sequence[str] = ("shaypal5",),
) -> HuggingFaceCard:
"""Assemble a ``HuggingFaceCard`` from the release tree.
@@ -595,6 +597,7 @@ def build_card(
tags=tuple(tags),
configs=configs,
body=body,
+ authors=tuple(authors),
)
@@ -635,6 +638,7 @@ def _frontmatter_dict(card: HuggingFaceCard) -> dict[str, Any]:
return {
"pretty_name": card.pretty_name,
"license": card.license,
+ **({"authors": list(card.authors)} if card.authors else {}),
"language": list(card.language),
"task_categories": list(card.task_categories),
"size_categories": list(card.size_categories),
diff --git a/scripts/package_kaggle_release.py b/scripts/package_kaggle_release.py
index 520dcec..8db7c3a 100644
--- a/scripts/package_kaggle_release.py
+++ b/scripts/package_kaggle_release.py
@@ -292,6 +292,7 @@ class Resource:
path: str
description: str
schema: ResourceSchema | None = None
+ name: str = "" # if empty, _resource_to_dict derives a slug from path
@dataclass(frozen=True)
@@ -597,6 +598,13 @@ def build_tier_resources(
column_descriptions = load_relational_column_descriptions(release_dir)
+ # Build a (table_name, col_name) → description map for task-split Parquet files.
+ # Task splits carry the same snapshot-level columns as the flat CSV, so we
+ # re-key the feature dictionary under a synthetic table name "snapshot".
+ snapshot_col_descs: dict[tuple[str, str], str] = {
+ ("snapshot", col): fd.description for col, fd in feature_dict.items()
+ }
+
resources: list[Resource] = []
resources.append(
@@ -626,7 +634,13 @@ def build_tier_resources(
Resource(
path=f"{tier}/tasks/{task}/{split}.parquet",
description=(f"{tier.capitalize()} tier {split} split for `{task}` ({rows_str})."),
- schema=ResourceSchema(fields=fields_from_parquet(split_path)),
+ schema=ResourceSchema(
+ fields=fields_from_parquet(
+ split_path,
+ column_descriptions=snapshot_col_descs,
+ table_name="snapshot",
+ )
+ ),
)
)
@@ -852,7 +866,13 @@ def _resource_to_dict(resource: Resource) -> dict[str, Any]:
per-column documentation).
"""
+ name = resource.name or re.sub(
+ r"[^a-z0-9]+",
+ "_",
+ Path(resource.path).with_suffix("").as_posix().lower(),
+ ).strip("_")
payload: dict[str, Any] = {
+ "name": name,
"path": resource.path,
"description": resource.description,
}
diff --git a/tests/scripts/test_preview_kaggle_page.py b/tests/scripts/test_preview_kaggle_page.py
index 428fd6d..4673550 100644
--- a/tests/scripts/test_preview_kaggle_page.py
+++ b/tests/scripts/test_preview_kaggle_page.py
@@ -51,7 +51,12 @@
# stopped firing. The whitelist is intentionally narrow.
_LINK_OK_PREFIXES = (
"https://github.com/leadforge-dev/leadforge",
+ "https://github.com/shaypalachy",
"https://huggingface.co/datasets/leadforge",
+ "https://huggingface.co/datasets/shaypal5",
+ "https://huggingface.co/shaypal5",
+ "https://www.kaggle.com/derelictpanda",
+ "https://www.shaypalachy.com/",
"https://example.com", # used by unit tests only
"LICENSE", # sibling-relative, resolves under the upload tree
"#", # in-document anchor (footnotes, etc.)