Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 36 additions & 1 deletion .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

## Current System State

**v0.5.0 in progress — Milestones 7–11 complete, v5 dataset shipped + canonical validation module.** Full simulation engine + render/bundle + exposure filtering + CLI commands + validation harness implemented. v4 engine changes merged (PR #21). v5 dataset regenerated with boosted leakage trap (snapshot day 10, Poisson(1) target-correlated boost) and validated via canonical sklearn pipeline (all checks pass). Canonical validation module added as single source of truth (`leadforge/validation/lead_scoring.py`).
**v0.5.0 in progress — Milestones 7–11 complete, v6 dataset shipped.** Full simulation engine + render/bundle + exposure filtering + CLI commands + validation harness implemented. v6 dataset adds: latent-aware touch intensity (`LatentDecayIntensity`), causally-grounded leakage trap (post-snapshot touches), student/instructor split, `touches_last_7_days` momentum feature, `acquisition_wave` cohort feature, soft ACV winsorization, GBM improvement validation.

---

Expand Down Expand Up @@ -85,6 +85,41 @@ No engine changes required — v5 is a build pipeline + validation improvement.
- [x] `tests/scripts/test_build_v5_snapshot.py` — imports from `leadforge.pipelines.build_v5` directly (no more `importlib` hack)
- [x] All 705 tests pass; lint clean

### v6: Causally-grounded dataset with student/instructor split (PR #31)

Engine changes:
- [x] `leadforge/mechanisms/counts.py` — `LatentDecayIntensity` mechanism: Poisson intensity with recency decay + latent-trait modulation (causal link: latent → touches AND latent → conversion)
- [x] `leadforge/mechanisms/policies.py` — `_TOUCH_LATENT_WEIGHTS` per motif family; `assign_mechanisms(latent_touch_intensity=True)` flag
- [x] `leadforge/simulation/engine.py` — `simulate_world(latent_touch_intensity=True)` passthrough
- [x] `leadforge/api/generator.py` — `generate(latent_touch_intensity=True)` via kwargs
- [x] `leadforge/render/snapshots.py` — `touches_last_7_days` feature added to snapshot builder
- [x] `leadforge/schema/features.py` — `touches_last_7_days` FeatureSpec added

Build pipeline:
- [x] `leadforge/pipelines/build_v6.py` — all pipeline functions: `derive_features`, `softcap_expected_acv`, `assign_acquisition_wave`, `compute_post_snapshot_touches`, `rename_and_select`, `subsample`, `inject_missingness`
- [x] `scripts/build_v6_snapshot.py` — CLI: generates both student + instructor CSVs
- [x] `scripts/validate_v6_dataset.py` — validates both exports: basic checks, determinism, baseline AUC, tree improvement, value-aware ranking, trap delta (10 seeds), cohort split
- [x] `scripts/quick_baseline_eval_v6.py` — LR + RF + GBM baselines, value-aware ranking, feature importance, trap detection

Datasets:
- [x] `lead_scoring_intro/lead_scoring_intro_v6.csv` — 1000 rows × 20 cols (student-safe, no leakage)
- [x] `lead_scoring_intro/lead_scoring_intro_v6_instructor.csv` — 1000 rows × 21 cols (+ `__leakage__touches_post_snapshot_15_90`)

Validation results:
- [x] Baseline AUC: 0.627 (within [0.62, 0.90])
- [x] GBM improvement: +0.022 over LR (5-seed average)
- [x] Trap delta: mean 0.046, min 0.034 (both above thresholds)
- [x] Value-aware uplift: +17.6% at K=25, +41.3% at K=50
- [x] All mandatory checks pass

Documentation + CI:
- [x] `lead_scoring_intro/RELEASE_v6.md` — column dictionary, missingness patterns, metrics, teaching guidance (4 lectures)
- [x] `lead_scoring_intro/BACKGROUND_v6.md` — ProcureFlow business context for students
- [x] `.github/workflows/ci.yml` — `validate-dataset-v6` job added
- [x] `tests/scripts/test_build_v6_snapshot.py` — 26 tests for pipeline functions
- [x] `tests/mechanisms/test_mechanisms.py` — 6 tests for `LatentDecayIntensity`
- [x] All 737 tests pass; lint + format clean

---

## Deferred Items
Expand Down
42 changes: 35 additions & 7 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,17 +68,45 @@ jobs:
python-version: "3.12"
- run: pip install -e ".[dev,scripts]"
- name: Check for v5 dataset
id: check
id: check-v5
run: |
if [ -f "lead_scoring_intro_v5.csv" ]; then
echo "found=true" >> "$GITHUB_OUTPUT"
echo "csv=lead_scoring_intro_v5.csv" >> "$GITHUB_OUTPUT"
else
echo "found=false" >> "$GITHUB_OUTPUT"
fi
- name: Run validator
if: steps.check.outputs.found == 'true'
run: python scripts/validate_lead_scoring_dataset.py --csv "${{ steps.check.outputs.csv }}" --enforce-1000
- name: Skip (no dataset)
if: steps.check.outputs.found != 'true'
run: echo "No lead_scoring_intro_v5.csv found in repo root — skipping validation"
- name: Run v5 validator
if: steps.check-v5.outputs.found == 'true'
run: python scripts/validate_lead_scoring_dataset.py --csv "${{ steps.check-v5.outputs.csv }}" --enforce-1000
- name: Skip v5 (no dataset)
if: steps.check-v5.outputs.found != 'true'
run: echo "No lead_scoring_intro_v5.csv found in repo root — skipping v5 validation"

validate-dataset-v6:
name: Validate v6 lead scoring dataset
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -e ".[dev,scripts]"
- name: Check for v6 datasets
id: check-v6
run: |
STUDENT="lead_scoring_intro/lead_scoring_intro_v6.csv"
INSTRUCTOR="lead_scoring_intro/lead_scoring_intro_v6_instructor.csv"
if [ -f "$STUDENT" ] && [ -f "$INSTRUCTOR" ]; then
echo "found=true" >> "$GITHUB_OUTPUT"
echo "student=$STUDENT" >> "$GITHUB_OUTPUT"
echo "instructor=$INSTRUCTOR" >> "$GITHUB_OUTPUT"
else
echo "found=false" >> "$GITHUB_OUTPUT"
fi
- name: Run v6 validator
if: steps.check-v6.outputs.found == 'true'
run: python scripts/validate_v6_dataset.py "${{ steps.check-v6.outputs.student }}" "${{ steps.check-v6.outputs.instructor }}"
- name: Skip v6 (no dataset)
if: steps.check-v6.outputs.found != 'true'
run: echo "No v6 datasets found — skipping v6 validation"
62 changes: 62 additions & 0 deletions lead_scoring_intro/BACKGROUND_v6.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# BACKGROUND v6 — Lead Scoring Intro Dataset

## Business context

You are a data scientist at **ProcureFlow**, a mid-market B2B SaaS company selling AP automation and procurement workflow software. ProcureFlow targets companies with 200–2,000+ employees in manufacturing, logistics, healthcare, financial services, and professional services.

The sales team generates leads through three channels:
- **Inbound marketing**: content downloads, webinars, website forms
- **SDR outbound**: cold outreach by sales development representatives
- **Partner referral**: introductions through consulting and technology partners

### The lead scoring problem

The sales team can only actively work a limited number of leads per quarter. Your job is to build a **lead scoring model** that predicts which leads are most likely to convert to paying customers within 90 days of entering the pipeline.

A good lead score helps sales prioritize their time — contacting high-probability leads first and deprioritizing unlikely conversions.

## Dataset description

The dataset contains **1,000 leads** observed at **day 14** of their lifecycle (the "snapshot day"). All features are computed from activity that occurred during the first 14 days. The target variable (`converted`) indicates whether the lead converted to a paying customer within 90 days.

### Deal sizes

ProcureFlow's annual contract value (ACV) ranges from **$18,000** (starter plan, small companies) to **$120,000** (enterprise plan, large companies). The `expected_acv` column provides an estimate of the deal size for each lead based on company size and any existing opportunity data.

This variation in deal size means that not all conversions are equally valuable — a model that identifies high-value conversions may be more useful than one that maximizes the number of conversions.

### Acquisition waves

Leads enter the pipeline in three cohorts (`acquisition_wave`): A (earliest), B (middle), C (most recent). These roughly correspond to different time periods. The market conditions and lead mix may vary across cohorts, which is relevant for thinking about how models perform on future data.

## What to expect

- **Base conversion rate**: ~30%
- **Baseline AUC**: A simple logistic regression achieves ~0.63 AUC
- **Missingness**: 5 columns have missing values (2–7% each) due to different data collection processes across lead sources
- **Feature interactions**: The relationship between engagement and conversion is nonlinear — tree-based models capture this better than linear models

## Key columns

| Column | What it measures |
|---|---|
| `industry` | Business sector |
| `region` | Geography |
| `company_size` | Employee headcount band |
| `company_revenue` | Revenue band |
| `contact_role` | Job function of primary contact |
| `seniority` | Job level |
| `lead_source` | How the lead was acquired |
| `opportunity_created` | Whether sales opened an opportunity |
| `demo_completed` | Whether the lead viewed demo content |
| `expected_acv` | Estimated deal size (USD) |
| `inbound_touches` | Marketing touches received |
| `outbound_touches` | Sales touches initiated |
| `touches_week_1` | Touches in first 7 days |
| `touches_last_7_days` | Touches in days 8–14 (recent momentum) |
| `days_since_first_touch` | Time since first engagement |
| `web_sessions` | Website visits |
| `sales_activities` | Sales rep logged activities |
| `days_since_last_touch` | Recency of last engagement |
| `acquisition_wave` | Cohort (A, B, or C) |
| `converted` | **Target**: 1 = converted within 90 days |
188 changes: 188 additions & 0 deletions lead_scoring_intro/RELEASE_v6.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# RELEASE v6 — Lead Scoring Intro Dataset

## Overview

v6 is the sixth iteration of the lead scoring intro dataset, designed for **3–4 lectures** on applied ML for lead scoring. It introduces causally-grounded leakage detection, value-aware ranking, nonlinear interaction structure, and cohort-based evaluation.

### Key changes from v5

| Change | v5 | v6 |
|---|---|---|
| Snapshot day | 10 | **14** |
| Leakage trap | Label-noise boost (`total_touches_90d`) | **Causal** post-snapshot touches (days 15–90) |
| Student/instructor split | Single file | **Two files**: student-safe + instructor |
| Momentum features | `touches_week_1`, `days_since_first_touch` | + **`touches_last_7_days`** |
| Cohort feature | — | **`acquisition_wave`** (A/B/C) |
| Value column | `expected_acv` (hard clip) | `expected_acv` (soft winsorize) |
| Tree improvement | Not validated | **Validated**: GBM > LR |
| MCAR injection | `days_since_last_touch` | + `expected_acv` (2%) |

---

## Snapshot definition

- **Snapshot day**: 14 (features computed from events on days 0–14 after lead creation)
- **Horizon**: 90 days (label derived from events through day 90)
- **Target**: `converted` — 1 if a `closed_won` event occurred within 90 days, 0 otherwise
- **Rows**: 1,000 (stratified subsample at 30% conversion rate)

---

## Column dictionary

### Categorical features (8)

| Column | Type | Description |
|---|---|---|
| `industry` | string | Industry vertical of the buying organization |
| `region` | string | Geographic region (US, UK, EU, APAC) |
| `company_size` | string | Employee headcount band (200–499, 500–999, etc.) |
| `company_revenue` | string | Estimated annual revenue band |
| `contact_role` | string | Functional area of primary contact |
| `seniority` | string | Seniority level (IC, manager, director, VP) |
| `lead_source` | string | Origination channel (inbound_marketing, sdr_outbound, partner_referral) |
| `acquisition_wave` | string | Cohort label (A, B, C) — roughly chronological |

### Binary features (2)

| Column | Type | Description |
|---|---|---|
| `opportunity_created` | int (0/1) | Whether an opportunity existed by snapshot day |
| `demo_completed` | int (0/1) | Whether demo page was viewed (proxy for demo) |

### Numeric features (9)

| Column | Type | Description | Missingness |
|---|---|---|---|
| `expected_acv` | float | Expected ACV in USD ($18k–$120k) | 2% MCAR |
| `inbound_touches` | int | Inbound marketing touches (days 0–14) | — |
| `outbound_touches` | int | Outbound sales touches (days 0–14) | — |
| `touches_week_1` | int | Touches in first 7 days | — |
| `touches_last_7_days` | int | Touches in last 7 days of snapshot window (days 8–14) | — |
| `days_since_first_touch` | float | Days from first touch to snapshot cutoff | Structural (no touches) + 2% MCAR |
| `web_sessions` | int | Web sessions within snapshot window | MAR by lead_source |
| `sales_activities` | int | Sales rep activities within snapshot window | — |
| `days_since_last_touch` | float | Days since last touch to snapshot cutoff | Structural (no touches) + 3% MCAR |

### Target (1)

| Column | Type | Description |
|---|---|---|
| `converted` | int (0/1) | 1 if closed_won within 90 days, 0 otherwise |

### Instructor-only leakage trap (1)

| Column | Type | Description |
|---|---|---|
| `__leakage__touches_post_snapshot_15_90` | int | Touch count in days 15–90 (post-snapshot) |

---

## Missingness patterns

| Pattern | Column(s) | Type | Rate |
|---|---|---|---|
| Structural | `days_since_last_touch` | No touches → NaN | ~2% |
| Structural | `days_since_first_touch` | No touches → NaN | ~1% |
| MAR | `web_sessions` | SDR outbound: 15%, inbound: 2%, partner: 5% | ~7% overall |
| MAR | `seniority` | Partner referral: 8%, others: 1% | ~2% overall |
| MCAR | `expected_acv` | Uniform 2% | ~2% |
| MCAR | `days_since_last_touch` | Additional 3% on top of structural | ~3% |
| MCAR | `days_since_first_touch` | Additional 2% on top of structural | ~2% |

Total missingness: ~192 values across 5 columns (~19 values per 1000 rows per column on average).

---

## Baseline metrics

Evaluated on a 70/30 stratified hold-out split (seed 42).

### Logistic Regression baseline

| Metric | Value |
|---|---|
| ROC-AUC | 0.627 |
| PR-AUC | 0.405 |
| Base rate | 30.0% |
| Precision@25 | 0.480 (Lift: 1.60x) |
| Precision@50 | 0.420 (Lift: 1.40x) |

### Tree model comparison (5-seed average)

| Model | Mean AUC | vs LR |
|---|---|---|
| Logistic Regression | 0.658 | — |
| GBM (100 trees) | 0.680 | +0.022 |

GBM reliably outperforms LR due to nonlinear interactions in the DGP (latent trait interactions with engagement patterns).

### Value-aware ranking

| K | By P(convert) | By expected value | Uplift |
|---|---|---|---|
| 25 | $884,208 | $1,039,434 | +17.6% |
| 50 | $1,379,208 | $1,949,380 | +41.3% |

### Leakage trap evaluation (instructor dataset)

| Metric | Value |
|---|---|
| Column | `__leakage__touches_post_snapshot_15_90` |
| Seeds | 10 (42–51) |
| Mean AUC delta | 0.0458 |
| Min AUC delta | 0.0343 |
| Max AUC delta | 0.0599 |

The trap is **causally grounded**: post-snapshot touches are higher for leads with higher latent intent/fit (the same traits that drive conversion). No label-noise injection is used.

---

## Teaching guidance

### Lecture 1: Pipeline + Evaluation

**Goal**: Students build their first ML pipeline and learn proper evaluation.

- Load `lead_scoring_intro_v6.csv`
- Handle missing values (5 columns have NaN — discuss structural vs MCAR)
- Build a baseline logistic regression with train/test split
- Evaluate: AUC, PR-AUC, confusion matrix
- Discuss class imbalance (30% positive rate)

### Lecture 2: Top-K + Expected Value Ranking

**Goal**: Students learn decision-oriented evaluation.

- Precision@K and Lift@K: "If sales can contact 50 leads, how many convert?"
- Expected value ranking: `P(convert) * expected_acv`
- Demonstrate that EV ranking captures 17–41% more ACV than probability ranking
- Discuss when value-aware scoring matters (heterogeneous deal sizes)

### Lecture 3: Feature Engineering + Error Slicing

**Goal**: Students learn to improve models through feature understanding.

- Examine `touches_last_7_days` (momentum) vs `touches_week_1` (early signal)
- Error analysis by `region`, `company_size`, `lead_source`
- Missing value patterns: why is `web_sessions` missing more for SDR outbound?
- Feature interactions: `opportunity_created` x `touches_last_7_days`

### Lecture 4: Trees/GBM + Nonlinearity (+ optional cohort shift)

**Goal**: Students see why tree models outperform linear models.

- Train GBM, compare AUC vs LR (+0.02 on average)
- Feature importance from RF/GBM
- Discuss nonlinear interactions captured by trees
- **Optional**: use `acquisition_wave` for cohort split (train A/B, test C)
- Demonstrates distribution shift and evaluation realism
- Random split AUC: 0.633, Cohort split AUC: 0.637 (small difference here, but teaches the concept)

### Instructor note: Leakage detection exercise

Use `lead_scoring_intro_v6_instructor.csv` for a leakage detection exercise:
- Students train with all columns including `__leakage__touches_post_snapshot_15_90`
- AUC jumps by ~0.046 on average
- Challenge: identify which column is leaking and explain why
- The trap is causally grounded (future engagement correlates with conversion via shared latent traits), making it a realistic example of temporal leakage
Loading
Loading