Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 42 additions & 2 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,51 @@

## Current System State

**v1.0.0 released (2026-05-02).** All milestones (M0–M13) complete. Package version bumped to 1.0.0 in pyproject.toml and leadforge/version.py. README updated with `pip install leadforge`. CHANGELOG consolidated under v1.0.0 heading.
**v1.0.0 released (2026-05-02).** All milestones (M0–M13) complete. Teaching dataset series (v1–v7) approved by consumer. Package version bumped to 1.0.0 in pyproject.toml and leadforge/version.py.

---

## Next Up — v4 Lead Scoring Dataset
## Next Up — Public Kaggle/HuggingFace Release

First public dataset release: `leadforge-b2b-lead-scoring`. Three difficulty tiers (intro/intermediate/advanced) as full relational bundles + flat CSV convenience exports, plus a research_instructor companion for intermediate.

### Public release — Phase 1: Dataset card improvement ✓ (in PR)

- [x] `render_dataset_card()` accepts `table_counts` dict → renders table inventory
- [x] Feature categories section rendered from `LEAD_SNAPSHOT_FEATURES` (category counts, examples, leakage flags)
- [x] `write_bundle()` passes `table_row_counts` to card renderer
- [x] 4 new tests (table inventory with/without counts, feature categories, leakage flags)

### Public release — Phase 2: Build script + flat CSV ✓ (in PR)

- [x] `scripts/build_public_release.py` — generates 4 bundles, validates, creates flat CSV exports
- [x] Flat CSV drops `current_stage` (contains terminal stages that encode the label at 90-day horizon)
- [x] All 4 bundles pass `validate_bundle()`

### Public release — Phase 3: Platform README + HF card ✓ (in PR)

- [x] `release/README.md` — landing page with directory structure, quick-start snippets, dataset summary, provenance
- [x] `release/HF_DATASET_CARD.md` — YAML frontmatter with configs for each difficulty tier

### Public release — Phase 4: Baseline notebook ✓ (in PR)

- [x] `release/notebooks/01_baseline_lead_scoring.ipynb` — LR + GBM baselines, P@K, value-aware ranking, feature importance
- [x] Excludes `current_stage` and leakage-flagged columns
- [x] Works from pre-generated Parquet files (no leadforge install needed)

### Public release — Phase 5: Generate final release + upload (pending)

- [ ] Run build script, verify SHA-256 hash determinism
- [ ] Upload to Kaggle and HuggingFace
- [ ] Announce

### Known issue: `current_stage` leakage at 90-day horizon

The full bundle snapshot includes `current_stage` which at day 90 contains terminal stages (`closed_won`/`closed_lost`). This perfectly encodes the label. The flat CSV export drops it; the Parquet task splits retain it with documentation. A proper fix (windowed snapshot or column redaction in the exposure layer) is deferred.

---

## Previous Focus — v4–v7 Lead Scoring Datasets

The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes + a build pipeline, followed by dataset release.

Expand Down
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -208,3 +208,11 @@ __marimo__/

# MacOS DS_Store files
.DS_Store

# Generated output bundles
out/
release/intro/
release/intermediate/
release/advanced/
release/intermediate_instructor/
release/LICENSE
4 changes: 3 additions & 1 deletion leadforge/api/bundle.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,9 @@ def write_bundle(
# ------------------------------------------------------------------
# 3. Dataset card and feature dictionary
# ------------------------------------------------------------------
(root / "dataset_card.md").write_text(render_dataset_card(bundle.spec, task_manifest=task))
(root / "dataset_card.md").write_text(
render_dataset_card(bundle.spec, task_manifest=task, table_counts=table_row_counts)
)
write_feature_dictionary(root / "feature_dictionary.csv")

# ------------------------------------------------------------------
Expand Down
62 changes: 45 additions & 17 deletions leadforge/narrative/dataset_card.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,11 @@

from __future__ import annotations

from collections import Counter
from typing import TYPE_CHECKING

from leadforge.schema.features import LEAD_SNAPSHOT_FEATURES

if TYPE_CHECKING:
from leadforge.core.models import WorldSpec
from leadforge.schema.tasks import TaskManifest
Expand All @@ -16,6 +19,7 @@
def render_dataset_card(
world_spec: WorldSpec,
task_manifest: TaskManifest | None = None,
table_counts: dict[str, int] | None = None,
) -> str:
"""Return a Markdown dataset card string for *world_spec*.

Expand All @@ -24,17 +28,18 @@ def render_dataset_card(
task_manifest: Optional task manifest whose ``description`` is used
as the label definition prose. When ``None`` or when
``description`` is empty, a generic fallback is rendered.
table_counts: Optional mapping of table name → row count. When
provided, the table inventory section renders actual counts
instead of a placeholder.

Sections present at all milestones:
Sections:
- Header (recipe id, version, seed, exposure mode)
- Narrative summary (company, product, market, GTM)
- Primary task and label definition
- Suggested use cases
- Caveats

Sections populated in later milestones (rendered as stubs here):
- Table inventory
- Feature categories
- Suggested use cases
- Caveats
"""
cfg = world_spec.config
narrative = world_spec.narrative
Expand Down Expand Up @@ -122,24 +127,47 @@ def render_dataset_card(
]

# ------------------------------------------------------------------
# Table inventory (stub — populated in later milestones)
# Table inventory
# ------------------------------------------------------------------
lines += [
"## Table inventory",
"",
"*Table counts will appear here once the simulation layer is implemented (v0.3.0+).*",
"",
]
lines += ["## Table inventory", ""]
if table_counts is not None:
lines += [
"| Table | Rows |",
"|---|---:|",
]
for tbl, count in table_counts.items():
lines.append(f"| {tbl} | {count:,} |")
lines.append("")
else:
lines += [
"*Table counts not available (pass ``table_counts`` to populate).*",
"",
]

# ------------------------------------------------------------------
# Feature categories (stub)
# Feature categories
# ------------------------------------------------------------------
lines += ["## Feature categories", ""]
category_counts: Counter[str] = Counter()
for feat in LEAD_SNAPSHOT_FEATURES:
category_counts[feat.category] += 1
lines += [
"## Feature categories",
"",
"*Feature dictionary will appear here once the schema layer is implemented (v0.3.0+).*",
"",
"| Category | Count | Examples |",
"|---|---:|---|",
]
for cat, count in category_counts.items():
examples = [
f.name for f in LEAD_SNAPSHOT_FEATURES if f.category == cat and not f.is_target
][:3]
lines.append(f"| {cat} | {count} | {', '.join(examples)} |")
leakage_cols = [f.name for f in LEAD_SNAPSHOT_FEATURES if f.leakage_risk]
if leakage_cols:
lines += [
"",
f"**Leakage-flagged columns:** {', '.join(f'`{c}`' for c in leakage_cols)}. "
"See `feature_dictionary.csv` for details.",
]
lines.append("")

# ------------------------------------------------------------------
# Suggested use cases
Expand Down
6 changes: 5 additions & 1 deletion leadforge/schema/features.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,8 +116,12 @@ class FeatureSpec:
FeatureSpec(
"current_stage",
"string",
"Funnel stage at snapshot anchor date.",
"Funnel stage at snapshot anchor date. WARNING: at full-horizon "
"(90-day) snapshots this contains terminal stages (closed_won / "
"closed_lost) that encode the label. Exclude from modeling or use "
"a windowed snapshot.",
"lead_meta",
leakage_risk=True,
),
FeatureSpec(
"is_mql",
Expand Down
104 changes: 104 additions & 0 deletions release/HF_DATASET_CARD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
language:
- en
license: mit
task_categories:
- tabular-classification
tags:
- lead-scoring
- b2b
- crm
- synthetic
- relational
- sales
- funnel
- binary-classification
- reproducible
size_categories:
- 1K-10K
configs:
- config_name: intro
data_files:
- split: train
path: intro/tasks/converted_within_90_days/train.parquet
- split: validation
path: intro/tasks/converted_within_90_days/valid.parquet
- split: test
path: intro/tasks/converted_within_90_days/test.parquet
- config_name: intermediate
data_files:
- split: train
path: intermediate/tasks/converted_within_90_days/train.parquet
- split: validation
path: intermediate/tasks/converted_within_90_days/valid.parquet
- split: test
path: intermediate/tasks/converted_within_90_days/test.parquet
- config_name: advanced
data_files:
- split: train
path: advanced/tasks/converted_within_90_days/train.parquet
- split: validation
path: advanced/tasks/converted_within_90_days/valid.parquet
- split: test
path: advanced/tasks/converted_within_90_days/test.parquet
---

# LeadForge: Synthetic B2B Lead Scoring Dataset

A relational, reproducible, multi-difficulty lead scoring dataset generated by [leadforge](https://github.com/leadforge-dev/leadforge) -- an open-source Python framework for synthetic CRM/funnel data.

## Why this dataset?

1. **Relational structure.** 9 normalized tables plus ML-ready task splits. Practice feature engineering from raw tables, or grab the flat file and start modeling.
2. **Three difficulty tiers.** Same world, different signal-to-noise ratios.
3. **Reproducible and leakage-safe.** Deterministic generation (seed 42), SHA-256 hashes, explicit leakage trap.

## Quick start

```python
from datasets import load_dataset

# Load intermediate difficulty
ds = load_dataset("leadforge/leadforge-b2b-lead-scoring", name="intermediate")
train = ds["train"].to_pandas()
valid = ds["validation"].to_pandas() # Note: file is valid.parquet, split name is "validation"
test = ds["test"].to_pandas()
```

Or use the flat CSV:

```python
import pandas as pd
df = pd.read_csv("hf://datasets/leadforge/leadforge-b2b-lead-scoring/intermediate/lead_scoring.csv")
```

## Dataset summary

| | Intro | Intermediate | Advanced |
|---|---|---|---|
| Leads | 5,000 | 5,000 | 5,000 |
| Features | 35 | 35 | 35 |
| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |
| Signal strength | 0.90 | 0.70 | 0.50 |
| Noise scale | 0.10 | 0.30 | 0.55 |
| Missing rate | 2% | 8% | 18% |

## The scenario

**Veridian Technologies** sells cloud procurement automation to mid-market firms (200-2,000 employees). Sales channels: inbound (45%), SDR outbound (35%), partner referrals (20%). Four buyer personas. **Task:** predict conversion within 90 days.

## Relational tables

Each difficulty tier includes 9 Parquet tables under `tables/`: accounts, contacts, leads, touches, sessions, sales_activities, opportunities, customers, subscriptions. These form a normalized CRM schema linked by foreign keys.

## Leakage trap

`total_touches_all` counts touches over the full 90-day window including post-snapshot events. Flagged as `leakage_risk=True` in `feature_dictionary.csv`.

## Research companion

`intermediate_instructor/` includes the full causal structure: world graph (DAG), latent trait registry, and mechanism assignments.

## Provenance

Generated by [leadforge](https://github.com/leadforge-dev/leadforge) v1.0.0, recipe `b2b_saas_procurement_v1`, seed 42. MIT license. See `manifest.json` in each bundle for SHA-256 hashes.
Loading
Loading