Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,13 +38,17 @@ First public dataset release: `leadforge-b2b-lead-scoring`. Three difficulty tie
- [x] Excludes `current_stage` and leakage-flagged columns
- [x] Works from pre-generated Parquet files (no leadforge install needed)

### Public release — Phase 5: Generate final release + upload (pending)
### Public release — Phase 5: Generate final release + upload (in progress)

- [ ] Run build script, verify SHA-256 hash determinism
- [x] Regenerate release bundles with difficulty-aware engine (PR #52 merged)
- [x] Verify three tiers produce different conversion rates (intro 41.5%, intermediate 20.1%, advanced 7.9%)
- [x] Update release/README.md — remove stale "Known limitations", add conversion rates to dataset summary
- [x] Update release/HF_DATASET_CARD.md — add conversion rates to summary table
- [ ] Verify SHA-256 hash determinism (re-run build, compare hashes)
- [ ] Upload to Kaggle and HuggingFace
- [ ] Announce

### Difficulty modulation ✓ (PR pending)
### Difficulty modulation ✓ (PR #52, merged)

- [x] `leadforge/core/models.py` — `DifficultyParams` frozen dataclass; optional field on `GenerationConfig`
- [x] `leadforge/mechanisms/policies.py` — `assign_mechanisms()` accepts `difficulty_params`; per-motif calibration computes target daily hazard from conversion_rate_range; signal_strength scales LatentScore weights
Expand Down
3 changes: 2 additions & 1 deletion release/HF_DATASET_CARD.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ A relational, reproducible, multi-difficulty lead scoring dataset generated by [
## Why this dataset?

1. **Relational structure.** 9 normalized tables plus ML-ready task splits. Practice feature engineering from raw tables, or grab the flat file and start modeling.
2. **Three difficulty tiers.** Same world, different signal-to-noise ratios.
2. **Three difficulty tiers.** Same world, different conversion rates, signal-to-noise ratios, and missingness.
3. **Reproducible and leakage-safe.** Deterministic generation (seed 42), SHA-256 hashes, explicit leakage trap.

## Quick start
Expand Down Expand Up @@ -79,6 +79,7 @@ df = pd.read_csv("hf://datasets/leadforge/leadforge-b2b-lead-scoring/intermediat
| Leads | 5,000 | 5,000 | 5,000 |
| Features | 35 | 35 | 35 |
| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |
| Conversion rate | 41.5% | 20.1% | 7.9% |
| Signal strength | 0.90 | 0.70 | 0.50 |
| Noise scale | 0.10 | 0.30 | 0.55 |
| Missing rate | 2% | 8% | 18% |
Expand Down
10 changes: 5 additions & 5 deletions release/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Most public lead scoring datasets are flat CSVs with opaque provenance. This one

1. **Relational structure.** 9 normalized tables (accounts, contacts, leads, touches, sessions, sales activities, opportunities, customers, subscriptions) plus ML-ready task splits. Practice feature engineering from raw tables, or grab the flat file and start modeling.

2. **Three difficulty tiers.** Same company, same product, same buyer personas -- different difficulty profiles. Each tier declares different signal strength, noise, and missingness parameters in its manifest. (See [Known limitations](#known-limitations) for current status.)
2. **Three difficulty tiers.** Same company, same product, same buyer personas -- different difficulty profiles that produce meaningfully different conversion rates, noise levels, and missingness.

3. **Reproducible and leakage-safe.** Deterministic generation from a fixed seed. SHA-256 hashes for every file in `manifest.json`. Leakage-prone columns (`total_touches_all`, `current_stage`) are explicitly flagged in the feature dictionary. All features are anchored at the snapshot date -- no post-cutoff data leaks in.

Expand Down Expand Up @@ -108,10 +108,14 @@ leadforge generate \
| Contacts | 4,200 | 4,200 | 4,200 |
| Columns | 35 (34 features + 1 target) | 35 | 35 |
| Target | `converted_within_90_days` | `converted_within_90_days` | `converted_within_90_days` |
| Conversion rate (target) | 30-45% | 18-28% | 8-15% |
| Conversion rate (observed) | 41.5% | 20.1% | 7.9% |
| Signal strength | 0.90 | 0.70 | 0.50 |
| Noise scale | 0.10 | 0.30 | 0.55 |
| Missing rate | 2% | 8% | 18% |

Higher difficulty means weaker signal, more noise, more missingness, and lower base conversion rate -- all modulated in the simulation engine. Target ranges are defined in `difficulty_profiles.yaml`.

## The scenario

**Veridian Technologies** is a Series B startup (Austin, US) selling **Veridian Procure**, a cloud-based procurement and AP automation platform, to mid-market firms (200-2,000 employees) in the US and UK.
Expand Down Expand Up @@ -152,10 +156,6 @@ The `intermediate_instructor/` bundle includes the full hidden causal structure:

This enables research on causal inference, model interpretability, and DGP-aware evaluation.

## Known limitations

- **Difficulty tiers share the same conversion rate.** The simulation engine does not yet modulate conversion rates by difficulty profile. All three tiers produce similar base rates (~70%). The difficulty profiles are declared in each bundle's manifest and will produce meaningfully different signal-to-noise ratios once the engine is updated. For now, the primary difference between tiers is the declared profile metadata.

## Provenance

| Field | Value |
Expand Down
Loading