From 1f45df09976b1cf10e5cca6a0a366a3270ff4364 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Wed, 29 Apr 2026 15:40:48 +0300 Subject: [PATCH] plan: v4 lead scoring dataset + leadforge engine roadmap MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add comprehensive v4 planning docs, updated agent instructions, and revised project roadmap driven by dataset needs from v1–v3 iterations. Docs added: - docs/v4/lead_scoring_v4_requirements.md — 7 requirements (value features, temporal momentum, structured missingness, leakage trap, redundancy fix, stronger category signal, robust validation) - docs/v4/dataset_contract.md — schema contract, temporal gates, missingness patterns, subsampling rules - docs/v4/engine_changes_spec.md — category effect scaling, windowed snapshot builder, structured missingness, leakage trap feature - docs/v4/validation_spec.md — 8 mandatory checks + 3 warning checks - docs/v4/implementation_plan.md — 4 milestones (M0–M3) with acceptance criteria and explicit mapping from existing roadmap items Updated: - CLAUDE.md — added repo map, generation workflow, student_public invariants, feature addition guide, v4 plan pointers - AGENTS.md — added v4 implementation guide, coding conventions, validation checklist, local testing commands - .agent-plan.md — v4 milestones as next work; M12–M15 items explicitly triaged (deferred/absorbed/discarded) No code changes. All 590 existing tests pass. Co-Authored-By: Claude Opus 4.6 --- .agent-plan.md | 95 ++++++++++--- AGENTS.md | 75 +++++++++++ CLAUDE.md | 146 ++++++++++++++++++++ docs/v4/dataset_contract.md | 72 ++++++++++ docs/v4/engine_changes_spec.md | 170 +++++++++++++++++++++++ docs/v4/implementation_plan.md | 172 ++++++++++++++++++++++++ docs/v4/lead_scoring_v4_requirements.md | 131 ++++++++++++++++++ docs/v4/validation_spec.md | 108 +++++++++++++++ 8 files changed, 949 insertions(+), 20 deletions(-) create mode 100644 docs/v4/dataset_contract.md create mode 100644 docs/v4/engine_changes_spec.md create mode 100644 docs/v4/implementation_plan.md create mode 100644 docs/v4/lead_scoring_v4_requirements.md create mode 100644 docs/v4/validation_spec.md diff --git a/.agent-plan.md b/.agent-plan.md index 1139182..fd215c5 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -10,22 +10,90 @@ --- -## Next Up — Milestone 12: CLI polish + JSON output (v0.5.0) +## Next Up — v4 Lead Scoring Dataset -Goal: Polish CLI commands with JSON output mode, richer help text, and progress feedback. +The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes followed by dataset build scripts. -- [ ] Add `--json` flag to `inspect` and `validate` for machine-readable output -- [ ] Add `--strict` flag to `validate` to control whether realism checks are errors vs warnings -- [ ] Improve CLI help text and error messages -- [ ] Tests for JSON output mode +See `docs/v4/implementation_plan.md` for full details. + +### v4-M0: Requirements + planning ⬜ (this PR) + +- [x] `docs/v4/lead_scoring_v4_requirements.md` +- [x] `docs/v4/dataset_contract.md` +- [x] `docs/v4/validation_spec.md` +- [x] `docs/v4/engine_changes_spec.md` +- [x] `docs/v4/implementation_plan.md` +- [x] Updated `CLAUDE.md` with repo map + generation commands +- [x] Updated `AGENTS.md` with v4 implementation guide +- [x] Updated `.agent-plan.md` (this file) + +### v4-M1: Engine — category signal + windowed snapshots ⬜ + +- [ ] Add `category_effect_scale` to difficulty profiles +- [ ] Apply scale in `mechanisms/policies.py` +- [ ] Add `snapshot_day` parameter to `render/snapshots.py` +- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv` +- [ ] Add new `FeatureSpec` entries to `schema/features.py` +- [ ] Tests for all changes +- [ ] Verify category spread ≥15% for key features at intro difficulty + +### v4-M2: Build pipeline + validation ⬜ + +- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness +- [ ] `scripts/validate_v4_dataset.py` — full validation per `docs/v4/validation_spec.md` +- [ ] Generate test dataset and verify all checks pass +- [ ] LR AUC 0.65–0.90 (without trap); ≥0.03 boost with trap + +### v4-M3: Documentation + release ⬜ + +- [ ] Generate `lead_scoring_intro_v4.csv` (in datasets-private repo) +- [ ] Write `RELEASE_v4.md` +- [ ] Update dataset repo README +- [ ] Update `.agent-plan.md` to reflect completion + +--- + +## Deferred Items + +### From existing roadmap (M12–M15) + +| Item | Status | Rationale | +|---|---|---| +| M12: CLI `--json` flag | Deferred | No consumer needs it yet; add post-v4 | +| M12: CLI `--strict` flag | Deferred | Per-check control is better than global flag | +| M12: CLI help text polish | Deferred | Low priority vs dataset | +| M14: Sample bundle commit | Absorbed into v4-M3 | v4 dataset IS the sample | +| M14: Notebook 1 (inspecting world) | Deferred | Do after v4 ships | +| M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this | +| M14: Notebook 3 (public vs instructor) | Discarded | No current audience | +| M14: Notebook 4 (recipe customization) | Discarded | Premature | +| M15: Docs polish + v1.0 RC | Deferred | Do after v4 ships | + +### From post-v1 list + +- Second vertical +- LTV labels as first-class task outputs +- Continuous-time / richer event engine +- Plugin architecture +- External-API enrichment +- Web UI or dashboard +- Engine fix: `is_sql=False` → never converts (deterministic invariant) --- ## Context Pointers -- Milestone 12 scope: `docs/leadforge_implementation_plan.md` §10 "Milestone 12" +- v4 requirements: `docs/v4/lead_scoring_v4_requirements.md` +- v4 dataset contract: `docs/v4/dataset_contract.md` +- v4 engine changes: `docs/v4/engine_changes_spec.md` +- v4 validation spec: `docs/v4/validation_spec.md` +- v4 implementation plan: `docs/v4/implementation_plan.md` +- Existing roadmap: `docs/leadforge_implementation_plan.md` - CLI commands: `leadforge/cli/commands/` - Validation modules: `leadforge/validation/` +- Snapshot builder: `leadforge/render/snapshots.py` +- Mechanism policy: `leadforge/mechanisms/policies.py` +- Difficulty profiles: `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` --- @@ -156,16 +224,3 @@ Goal: Polish CLI commands with JSON output mode, richer help text, and progress - `leadforge/recipes/`: registry + `b2b_saas_procurement_v1/recipe.yaml` - `.github/workflows/ci.yml`: lint, typecheck, test matrix (3.11 + 3.12) with coverage upload - 20 tests passing; ruff + mypy clean - ---- - -## Deferred (Post-v1) - -- Second vertical -- LTV labels as first-class task outputs -- Continuous-time / richer event engine -- Plugin architecture -- External-API enrichment -- Web UI or dashboard - -See `docs/leadforge_implementation_plan.md` §10 for the full deferral list. diff --git a/AGENTS.md b/AGENTS.md index 0827d08..d67ced5 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -25,3 +25,78 @@ Do **not** leave threads unresolved after the commit is pushed. ## Branch & PR Conventions See CLAUDE.md for the full mandatory branch/PR workflow (branch → commit → update `.agent-plan.md` → open PR). + +--- + +## v4 Implementation Guide + +### What is v4? + +A pedagogically improved lead scoring dataset (single CSV) for an intro ML course. The engine changes are small and targeted. See `docs/v4/` for full specs. + +### Implementation order + +``` +v4-M0 (planning PR — already done) + └── v4-M1: engine changes (category signal + windowed snapshots) + └── v4-M2: build pipeline + validation scripts + └── v4-M3: dataset generation + release docs +``` + +### Key files to modify per milestone + +**v4-M1 (engine):** +- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale` +- `leadforge/mechanisms/policies.py` — apply scale to categorical influences +- `leadforge/render/snapshots.py` — add `snapshot_day` param, windowed aggregation +- `leadforge/schema/features.py` — add new FeatureSpec entries +- Tests in `tests/mechanisms/` and `tests/render/` + +**v4-M2 (build pipeline):** +- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap +- `scripts/validate_v4_dataset.py` (new) — dataset-level validation +- These live in the leadforge repo (not datasets-private) + +**v4-M3 (release):** +- Work in `leadforge-datasets-private` repo +- `lead_scoring_intro/lead_scoring_intro_v4.csv` +- `lead_scoring_intro/RELEASE_v4.md` + +### Coding conventions for v4 + +1. **Backward compatibility:** All engine changes must default to current behavior. New parameters must have defaults that produce identical output when unset. +2. **No simulation loop changes:** Do not modify the daily step logic in `engine.py`. v4 changes are in mechanism weights and snapshot rendering only. +3. **Temporal correctness:** Every feature computation must be explicitly gated by snapshot day. Use `event_timestamp <= lead_created_at + snapshot_day` — never `<`. +4. **Test coverage:** Every new parameter and feature must have unit tests. Test both `snapshot_day=None` (backward compat) and `snapshot_day=21` (v4 mode). +5. **Determinism:** All new stochastic operations must use seeded RNG. Verify with a determinism test (same seed → identical output). + +### Validation checklist for v4 dataset + +Before declaring v4-M2 complete, the dataset must pass: + +- [ ] 1,000 rows, 18 columns +- [ ] 30% conversion rate (±1%) +- [ ] No deterministic groups (n≥50 at 0% or 100% conversion) +- [ ] LR AUC 0.65–0.90 (without leakage trap) +- [ ] LR AUC boost ≥0.03 when leakage trap included +- [ ] `web_sessions` missingness: outbound rate > 3× inbound rate +- [ ] `seniority` missingness: partner_referral rate > 3× others +- [ ] Reproducible with seed 42 +- [ ] `total_touches_all` uses full 90-day data (confirmed by AUC boost) + +### How to test engine changes locally + +```bash +# Quick smoke test: generate a small bundle and inspect +leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 --difficulty intro --n-leads 1000 --out /tmp/test_bundle +leadforge validate /tmp/test_bundle + +# Check category signal spread +python -c " +import pandas as pd +df = pd.read_parquet('/tmp/test_bundle/tasks/converted_within_90_days/train.parquet') +for col in ['role_function', 'seniority', 'estimated_revenue_band']: + rates = df.groupby(col)['converted_within_90_days'].mean() + print(f'{col}: spread={rates.max()-rates.min():.1%}') +" +``` diff --git a/CLAUDE.md b/CLAUDE.md index 0a9a5dc..1e3ae7e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -212,7 +212,153 @@ Key abstractions: `Recipe`, `GenerationConfig`, `WorldSpec`, `WorldBundle`, `Exp --- +## Repository Map + +``` +leadforge/ # Python package root +├── api/ # Public API: Generator, Recipe, Bundle +│ ├── generator.py # Generator.from_recipe() → .generate() → WorldBundle +│ ├── recipes.py # Recipe loading, config resolution +│ └── bundle.py # write_bundle() orchestrator +├── cli/ # Click CLI +│ ├── main.py # CLI entry point +│ └── commands/ # generate, inspect, validate, list_recipes +├── core/ # Foundational utilities +│ ├── rng.py # RNGRoot with named substreams +│ ├── ids.py # Deterministic ID generation (acct_000001, etc.) +│ ├── models.py # GenerationConfig, WorldSpec, WorldBundle +│ ├── enums.py # ExposureMode, DifficultyProfile +│ └── exceptions.py # Custom exception hierarchy +├── narrative/ # Vertical narrative (company, market, personas) +│ ├── spec.py # NarrativeSpec and sub-spec dataclasses +│ └── dataset_card.py # Markdown dataset card renderer +├── schema/ # Relational data model +│ ├── entities.py # 9 entity row dataclasses (AccountRow, LeadRow, etc.) +│ ├── features.py # LEAD_SNAPSHOT_FEATURES — canonical feature spec +│ ├── relationships.py # FK constraints (ALL_CONSTRAINTS) +│ ├── tasks.py # SplitSpec, TaskManifest, CONVERTED_WITHIN_90_DAYS +│ └── dictionaries.py # Feature dictionary CSV writer +├── structure/ # Hidden world graph +│ ├── graph.py # WorldGraph (DAG wrapper) +│ ├── motifs.py # 5 motif families +│ ├── rewiring.py # Stochastic graph perturbation +│ └── sampler.py # sample_hidden_graph() +├── mechanisms/ # Node/edge behavior +│ ├── policies.py # assign_mechanisms() — motif → MechanismAssignment +│ ├── hazards.py # ConversionHazard +│ ├── transitions.py # StageSequence, HazardTransition +│ ├── counts.py # PoissonIntensity, RecencyDecayIntensity +│ ├── categorical.py # CategoricalInfluence, CHANNEL_QUALITY_SCORES +│ └── scores.py # LatentScore +├── simulation/ # World evolution +│ ├── engine.py # simulate_world() — 90-day daily loop +│ ├── state.py # LeadSimState (per-lead mutable state) +│ └── population.py # build_population() — accounts, contacts, leads +├── render/ # Bundle output +│ ├── snapshots.py # build_snapshot() — ML-ready lead table +│ ├── relational.py # to_dataframes() — 9-table dict +│ ├── tasks.py # write_task_splits() — train/valid/test Parquet +│ └── manifests.py # build_manifest(), write_manifest() +├── exposure/ # Truth filtering +│ ├── modes.py # apply_exposure() dispatch +│ ├── metadata.py # write_metadata_dir() for instructor mode +│ └── filters.py # BundleFilter, FILTERS dict +├── validation/ # Bundle quality checks +│ ├── bundle_checks.py # validate_bundle() orchestrator +│ ├── invariants.py # Determinism + exposure monotonicity +│ ├── realism.py # Conversion rates, feature ranges, stage diversity +│ ├── difficulty.py # Known difficulty profile validation +│ └── drift.py # Cross-seed stability +└── recipes/ # Recipe definitions + └── b2b_saas_procurement_v1/ + ├── recipe.yaml # Recipe metadata + defaults + ├── narrative.yaml # Company, product, market, personas, funnel + └── difficulty_profiles.yaml # intro/intermediate/advanced +``` + +### Related repos + +- **leadforge-datasets-private** — generated dataset archive + - `b2b_saas_procurement_v1__intro__seed42/` — full relational bundle + - `lead_scoring_intro/` — simplified single-CSV versions (v1–v4) + - `scripts/` — build and validation scripts for simplified CSVs + +--- + +## Generation Workflow + +### Generate a full bundle + +```bash +leadforge generate \ + --recipe b2b_saas_procurement_v1 \ + --seed 42 \ + --mode student_public \ + --difficulty intro \ + --n-leads 5000 \ + --out ./out/bundle +``` + +### Build a simplified CSV (v4 example) + +```bash +# In leadforge-datasets-private repo: +python scripts/build_v4_snapshot.py /path/to/bundle lead_scoring_intro/lead_scoring_intro_v4.csv +``` + +### Validate a simplified CSV + +```bash +python scripts/validate_v4_dataset.py lead_scoring_intro/lead_scoring_intro_v4.csv +``` + +### Validate a full bundle + +```bash +leadforge validate ./out/bundle +``` + +--- + +## student_public Mode Invariants + +These are non-negotiable for any dataset published in `student_public` mode: + +1. **No post-snapshot features** — all features computed from events ≤ snapshot day only. +2. **No outcome-stage columns** — `current_stage`, `funnel_stage` with `closed_won`/`closed_lost` are banned. +3. **No deterministic single-feature mapping** — for any feature value with n≥50, conversion rate must be in [2%, 98%]. +4. **No hidden truth** — latent scores, mechanism parameters, world graph not included. +5. **No direct outcome columns** — `conversion_timestamp`, `close_outcome` are banned. +6. **No zero-variance features** — every included feature must have ≥2 distinct values. + +Exception: deliberately included leakage traps (e.g., `total_touches_all` in v4) must be clearly documented in release notes and feature dictionary. + +--- + +## How to Add New Features to the Snapshot + +1. Add a `FeatureSpec` entry to `LEAD_SNAPSHOT_FEATURES` in `leadforge/schema/features.py`. +2. Compute the feature value in `build_snapshot()` in `leadforge/render/snapshots.py`. +3. If the feature needs new event data, add it to the simulation loop in `leadforge/simulation/engine.py`. +4. Update `leadforge/schema/dictionaries.py` if the feature dictionary format changes. +5. Run `pytest` and `leadforge validate` on a generated bundle. +6. Update the feature dictionary CSV description. + +--- + +## v4 Dataset Plan + +The current focus is producing a v4 lead scoring intro dataset. See `docs/v4/` for: +- `lead_scoring_v4_requirements.md` — what v4 must achieve +- `dataset_contract.md` — schema contract and temporal gates +- `engine_changes_spec.md` — what changes in the engine +- `validation_spec.md` — automated validation checks +- `implementation_plan.md` — milestone breakdown + +--- + ## Reference Docs - Design decisions: `docs/leadforge_design_doc.md` - Architecture/spec: `docs/leadforge_architecture_spec.md` - Implementation roadmap: `docs/leadforge_implementation_plan.md` +- v4 dataset plan: `docs/v4/implementation_plan.md` diff --git a/docs/v4/dataset_contract.md b/docs/v4/dataset_contract.md new file mode 100644 index 0000000..4b7cb0a --- /dev/null +++ b/docs/v4/dataset_contract.md @@ -0,0 +1,72 @@ +# v4 Dataset Contract + +## Snapshot definition + +- **Snapshot day:** Day 21 after `lead_created_at` (configurable, default 21). +- **Observation window:** Days 0–21 inclusive. All features computed from events in this window only. +- **Prediction horizon:** Days 22–90. The target `converted` reflects whether `closed_won` occurs in the full 90-day window. +- **Temporal guarantee:** No feature (except the explicitly marked leakage trap) uses information from after the snapshot day. + +## What is pre-snapshot (valid for features) + +| Data source | Temporal gate | +|---|---| +| Account attributes | Static — always valid | +| Contact attributes | Static — always valid | +| Lead metadata (source, etc.) | Lead creation — always valid | +| Touch events | `touch_timestamp ≤ lead_created_at + snapshot_day` | +| Session events | `session_timestamp ≤ lead_created_at + snapshot_day` | +| Sales activity events | `activity_timestamp ≤ lead_created_at + snapshot_day` | +| Opportunity records | `opportunity.created_at ≤ lead_created_at + snapshot_day` | +| ACV estimates | From opportunity if available by snapshot; else account heuristic | + +## What is post-snapshot (invalid for features) + +| Data | Why invalid | +|---|---| +| `current_stage` at day 90 | Contains `closed_won` / `closed_lost` — outcome data | +| `is_sql` (final state flag) | Engine invariant: `is_sql=False` → never converts. Deterministic. | +| `conversion_timestamp` | Direct outcome information | +| Touch/session/activity events after snapshot day | Future data | +| Opportunity close outcome | Post-outcome | +| `total_touches_all` | ⚠️ Intentional leakage trap — counts full 90-day touches | + +## Leakage trap contract + +The feature `total_touches_all` deliberately violates the snapshot boundary: +- It counts touches over the **full 90-day simulation**, not just up to snapshot. +- It is included to teach students about temporal leakage detection. +- It must be clearly marked in the feature dictionary and release notes. +- The validation script must detect it and flag it (but not fail the build). +- Removing this feature should drop AUC by ≥0.03. + +## Missingness contract + +| Column | Pattern | Rate | Condition | +|---|---|---|---| +| `days_since_last_touch` | Structural | Natural | NaN when `total touches == 0` by snapshot | +| `web_sessions` | Source-conditional | ~15% for `sdr_outbound`, ~2% for `inbound_marketing`, ~5% for `partner_referral` | CRM tracking gaps | +| `seniority` | Source-conditional | ~8% for `partner_referral`, ~1% for others | Referral partners omit contact details | +| `days_since_last_touch` | Additional MCAR | ~3% | Random CRM logging gaps (on top of structural) | + +## Target definition + +``` +converted = 1 if lead reached closed_won within 90 days of lead_created_at +converted = 0 otherwise (including closed_lost, still in funnel, churned) +``` + +The target is derived from simulated events, never directly sampled. + +## Subsampling contract + +- Source bundle: 5,000 leads generated with `b2b_saas_procurement_v1`, seed 42, difficulty intro. +- Stratified subsampling to 1,000 rows at ~30% conversion rate. +- All negatives retained (up to 700); positives downsampled. +- Subsampling preserves within-class feature distributions. + +## Reproducibility + +- Seed: 42 (or documented if changed). +- All stochastic operations use `np.random.RandomState(seed)` or derived substreams. +- Same (seed, recipe, leadforge version) → byte-identical CSV output. diff --git a/docs/v4/engine_changes_spec.md b/docs/v4/engine_changes_spec.md new file mode 100644 index 0000000..d3712ca --- /dev/null +++ b/docs/v4/engine_changes_spec.md @@ -0,0 +1,170 @@ +# v4 Engine Changes Specification + +## Overview + +v4 requires **two categories** of changes to the leadforge codebase: +1. **Mechanism / difficulty tuning** — make intro difficulty produce stronger category-level signal. +2. **Snapshot builder enhancements** — compute windowed aggregates, ACV derivation, structured missingness, and the leakage trap feature. + +Neither category requires changes to the simulation loop itself (`engine.py`'s daily step logic). The simulation produces the same event stream; we change how features are derived from it. + +--- + +## Change 1: Stronger category signal at intro difficulty + +### Problem + +The current mechanism policy (`mechanisms/policies.py`) produces conversion rates that are nearly uniform across categories at intro difficulty. For example, `contact_role` spreads only 11% (25.6%–36.7% after subsampling). This yields LR AUC ~0.62, which is too low for a useful teaching dataset. + +### Root cause + +The `assign_mechanisms()` function builds a `LatentScore` with weights that are quite flat across categories. The intro difficulty profile specifies `signal_strength: 0.90` but this controls noise scale, not the magnitude of category effects. + +### Solution + +Add **category effect multipliers** to the difficulty profile YAML: + +```yaml +intro: + # ... existing fields ... + category_effect_scale: 1.8 # amplify category → latent score effects +``` + +In `mechanisms/policies.py`, scale the `CategoricalInfluence` weights by `category_effect_scale` when building the `LatentScore`. This widens the gap between, say, `vp_finance` and `it_director` conversion rates without changing the overall noise structure. + +### Target outcome + +After this change + subsampling to 30%, category spreads should be: +- `contact_role`: ≥15% spread +- `company_revenue`: ≥12% spread +- `seniority`: ≥10% spread +- Baseline LR AUC: 0.70–0.85 + +### Files affected + +- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale` +- `leadforge/mechanisms/policies.py` — use `category_effect_scale` when building categorical influences +- `tests/mechanisms/test_policies.py` — test that different scales produce different spread + +### Risk + +Low. The change is additive (new config field with a default of 1.0 for backward compatibility). Existing tests continue to pass at `category_effect_scale=1.0`. + +--- + +## Change 2: Snapshot builder — windowed aggregates and new features + +### Problem + +The current `render/snapshots.py` computes all aggregates over the full simulation horizon. v4 needs aggregates gated by a configurable snapshot day, plus new derived features. + +### Solution + +Add a new function or extend `build_snapshot()` to accept a `snapshot_day` parameter: + +```python +def build_snapshot( + result: SimulationResult, + population: PopulationResult, + horizon_days: int = 90, + snapshot_day: int | None = None, # NEW — default None means use horizon_days +) -> pd.DataFrame: +``` + +When `snapshot_day` is set, all event aggregations filter to events within `[lead_created_at, lead_created_at + snapshot_day]`. + +### New features to compute + +| Feature | Computation | Notes | +|---|---|---| +| `touches_week_1` | Count touches where `days_after_creation ≤ 7` | Momentum signal | +| `days_since_first_touch` | `snapshot_day - first_touch_day` (NaN if no touches) | Lead age signal | +| `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature | +| `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap | + +### Files affected + +- `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations +- `leadforge/schema/features.py` — add new `FeatureSpec` entries for the new columns +- `tests/render/test_snapshots.py` — test windowed aggregation correctness + +### Risk + +Medium. The snapshot builder is well-tested but core to correctness. The `snapshot_day` parameter should be additive (default `None` preserves existing behavior). New features are computed alongside existing ones. + +--- + +## Change 3: Structured missingness injection + +### Problem + +Current missingness is MCAR (random injection). v4 needs conditional missingness. + +### Solution + +Add a missingness injection step to the v4 build script (NOT to the engine's `build_snapshot`). This keeps the engine's output clean and makes missingness a dataset-packaging concern. + +The build script (`scripts/build_v4_snapshot.py`) applies missingness after snapshot construction: + +```python +def inject_missingness(df: pd.DataFrame, rng: np.random.RandomState) -> pd.DataFrame: + # 1. web_sessions: 15% missing for sdr_outbound, 2% inbound, 5% partner + for source, rate in [("sdr_outbound", 0.15), ("inbound_marketing", 0.02), ("partner_referral", 0.05)]: + mask = (df["lead_source"] == source) & (rng.random(len(df)) < rate) + df.loc[mask, "web_sessions"] = np.nan + + # 2. seniority: 8% missing for partner_referral, 1% for others + partner_mask = (df["lead_source"] == "partner_referral") & (rng.random(len(df)) < 0.08) + other_mask = (df["lead_source"] != "partner_referral") & (rng.random(len(df)) < 0.01) + df.loc[partner_mask | other_mask, "seniority"] = np.nan + + # 3. days_since_last_touch: additional 3% MCAR on top of structural NaN + dslt_mask = rng.random(len(df)) < 0.03 + df.loc[dslt_mask, "days_since_last_touch"] = np.nan + + return df +``` + +### Files affected + +- `scripts/build_v4_snapshot.py` (new) — missingness injection +- No changes to `leadforge/` core modules for missingness + +### Risk + +Low. Missingness is applied post-generation, outside the engine. + +--- + +## Change 4: Leakage trap feature + +### Problem + +Students need a feature that looks valid but violates temporal boundaries. + +### Solution + +The v4 build script computes `total_touches_all` by counting ALL touches in the full 90-day window (not gated by snapshot day). This is computed alongside the snapshot but uses different temporal filtering. + +### Files affected + +- `scripts/build_v4_snapshot.py` — compute `total_touches_all` from full event stream +- Feature dictionary and release notes — mark as leakage trap + +### Risk + +None to the engine. The trap is a build-script concern. + +--- + +## Summary of engine vs. script changes + +| Change | Where | Risk | +|---|---|---| +| Category effect scaling | `leadforge/` core (mechanisms, difficulty profiles) | Low | +| Snapshot `snapshot_day` parameter | `leadforge/` core (render/snapshots) | Medium | +| New features (ACV, momentum, first_touch) | `leadforge/` core (render/snapshots, schema/features) | Medium | +| Structured missingness | `scripts/` (build script only) | Low | +| Leakage trap | `scripts/` (build script only) | None | + +Total engine-side changes: ~200–400 lines across 4–5 files. Build script: ~250 lines new. diff --git a/docs/v4/implementation_plan.md b/docs/v4/implementation_plan.md new file mode 100644 index 0000000..89b8238 --- /dev/null +++ b/docs/v4/implementation_plan.md @@ -0,0 +1,172 @@ +# v4 Implementation Plan + +## Overview + +This plan implements the v4 lead scoring dataset in 4 milestones across 4–6 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria. + +## Relationship to existing roadmap + +v4 work slots into the existing leadforge roadmap as follows: + +| Existing milestone | Status | v4 interaction | +|---|---|---| +| M0–M11 | ✅ Complete | No changes needed | +| M12 (CLI polish) | ⬜ Planned | **Deferred** — low priority vs v4 dataset needs. Integrate after v4. | +| M13 (Validation harness) | ✅ Implemented as M11 | v4 extends with dataset-level validation | +| M14 (Sample datasets + notebooks) | ⬜ Planned | **Absorbed into v4-M3** — v4 dataset IS the sample dataset | +| M15 (Docs polish + v1.0 RC) | ⬜ Planned | **Deferred** — do after v4 ships | + +### Explicitly discarded items + +| Item | Rationale | +|---|---| +| M12 `--json` flag for inspect/validate | Nice-to-have; no dataset consumer needs it yet. Can add later. | +| M12 `--strict` flag for validate | Validation strictness is better controlled per-check, not globally. | +| M14 Notebook 3 (public vs instructor comparison) | No current audience for this; instructor mode is not used in the course. | +| M14 Notebook 4 (recipe customization walkthrough) | Premature — recipe system is stable but not user-facing yet. | + +### Explicitly kept / integrated items + +| Item | How it maps to v4 | +|---|---| +| M14 Sample bundle generation | v4-M2 generates the source bundle | +| M14 Lead-scoring baseline notebook | v4-M3 includes a validation notebook or script | +| M15 Docs audit | v4-M0 updates CLAUDE.md and AGENTS.md; v4-M3 produces RELEASE_v4.md | + +--- + +## v4 Milestones + +### v4-M0: Requirements, contract, and agent instructions + +**Goal:** Establish the v4 dataset contract and update repo documentation so implementation can begin immediately. + +**Deliverables:** +- `docs/v4/lead_scoring_v4_requirements.md` — full requirements +- `docs/v4/dataset_contract.md` — schema contract, temporal gates, missingness +- `docs/v4/validation_spec.md` — automated check specifications +- `docs/v4/engine_changes_spec.md` — what changes where and why +- `docs/v4/implementation_plan.md` — this file +- Updated `CLAUDE.md` — repository map, generation/validation commands +- Updated `AGENTS.md` — implementation conventions for v4 work +- Updated `.agent-plan.md` — reflects v4 as next work + +**Acceptance criteria:** +- [ ] All docs are internally consistent +- [ ] CLAUDE.md contains repo map and commands +- [ ] .agent-plan.md points to v4 milestones +- [ ] No contradictions with existing architecture docs + +**PR:** This PR (the planning PR). + +--- + +### v4-M1: Engine — category signal tuning + snapshot enhancements + +**Goal:** Make the engine produce datasets with stronger category signal and support windowed snapshot computation. + +**Deliverables:** +1. `difficulty_profiles.yaml` — add `category_effect_scale: 1.8` to intro profile +2. `mechanisms/policies.py` — apply `category_effect_scale` to categorical influence weights +3. `render/snapshots.py` — add optional `snapshot_day` parameter for windowed aggregation +4. `schema/features.py` — add `FeatureSpec` entries for new columns (`touches_week_1`, `days_since_first_touch`, `expected_acv`) +5. Tests for all changes + +**Acceptance criteria:** +- [ ] `category_effect_scale=1.0` produces identical output to current engine (backward compat) +- [ ] `category_effect_scale=1.8` produces category spreads ≥15% for `contact_role` +- [ ] `snapshot_day=21` correctly filters events to first 21 days +- [ ] `touches_week_1` counts only days 0–7 touches +- [ ] `expected_acv` uses opportunity ACV when available, else band midpoint +- [ ] All existing tests pass +- [ ] New tests cover the new parameters + +**Estimated size:** ~400 lines diff across 5 files + tests. + +**PR:** Single PR: `feat: v4 engine — category signal tuning + windowed snapshots` + +--- + +### v4-M2: Build pipeline — v4 snapshot builder + structured missingness + +**Goal:** Create the v4 build script that transforms a generated bundle into the final CSV. + +**Deliverables:** +1. `scripts/build_v4_snapshot.py` — snapshot builder with: + - Day-21 windowed features + - Leakage trap feature (`total_touches_all`) + - Structured missingness injection + - Stratified subsampling to 1,000 rows / 30% conversion + - Column selection and renaming +2. `scripts/validate_v4_dataset.py` — validation script per validation spec +3. Generated `lead_scoring_intro_v4.csv` (in datasets repo, not leadforge) + +**Acceptance criteria:** +- [ ] Build script produces 1,000 rows × 18 columns +- [ ] Conversion rate is 30% (±1%) +- [ ] `total_touches_all` uses full 90-day data (leakage trap) +- [ ] `web_sessions` missing rate for outbound > 3× inbound rate +- [ ] `seniority` missing rate for partner_referral > 3× others +- [ ] `days_since_last_touch` has structural + injected NaNs +- [ ] Validation script passes all mandatory checks +- [ ] Baseline LR AUC (without trap) in [0.65, 0.90] +- [ ] LR AUC boost with trap ≥ 0.03 +- [ ] No deterministic groups (n≥50 at 0% or 100%) +- [ ] Reproducible with seed 42 + +**Estimated size:** ~350 lines (build script) + ~200 lines (validator). + +**PR:** Single PR: `feat: v4 build pipeline + validation` + +--- + +### v4-M3: Documentation + release + +**Goal:** Produce the final dataset files and release documentation. + +**Deliverables (in leadforge-datasets-private repo):** +1. `lead_scoring_intro/lead_scoring_intro_v4.csv` +2. `lead_scoring_intro/RELEASE_v4.md` +3. Updated `lead_scoring_intro/BACKGROUND.md` (if needed for v4 framing) +4. Updated `README.md` (dataset index) + +**Deliverables (in leadforge repo):** +1. Updated `.agent-plan.md` reflecting completion + +**Acceptance criteria:** +- [ ] CSV passes all validation checks +- [ ] RELEASE_v4.md documents snapshot day, target definition, changes from v3, leakage trap +- [ ] README in datasets repo marks v4 as recommended +- [ ] Previous versions marked as superseded + +**PR:** Two PRs (one per repo). + +--- + +## Dependency graph + +``` +v4-M0 (this PR) + └── v4-M1 (engine changes) + └── v4-M2 (build pipeline + validation) + └── v4-M3 (docs + release) +``` + +Strictly sequential — each milestone depends on the previous. + +--- + +## Timeline estimate + +Not providing time estimates per project convention. The work is 4 PRs of moderate size (~300–500 lines each). + +--- + +## What this plan does NOT do + +- Does not change the simulation loop (`engine.py` daily step logic) +- Does not change the relational bundle format +- Does not change exposure modes +- Does not add new recipes +- Does not implement M12 (CLI polish) — deferred +- Does not implement the engine fix for `is_sql=False → never converts` (deferred to a separate issue; v4 avoids `is_sql` entirely) diff --git a/docs/v4/lead_scoring_v4_requirements.md b/docs/v4/lead_scoring_v4_requirements.md new file mode 100644 index 0000000..ff2e146 --- /dev/null +++ b/docs/v4/lead_scoring_v4_requirements.md @@ -0,0 +1,131 @@ +# Lead Scoring Dataset v4 — Requirements + +## Purpose + +This document defines the requirements for the **v4 lead scoring intro dataset**, the primary pedagogical output of leadforge for a BA-level intro ML course. It is informed by three prior dataset iterations (v1–v3) and the lessons learned from each. + +## Prior version history and lessons + +| Version | Key issue | What we learned | +|---|---|---| +| v1 | `funnel_stage` contained `closed_won`/`closed_lost` — perfect leakage | Must validate that no single feature determines the target | +| v2 | Snapshot at day 90 with 90-day target — post-mortem, not prediction | Snapshot must be strictly earlier than outcome horizon | +| v2 | `reached_sql=0` → 0% conversion (n=127); `has_opportunity=1` → 0% (n=235) | Binary proxies from engine invariants create deterministic groups | +| v3 | Day-21 snapshot + non-deterministic proxies — clean but AUC only 0.62 | Engine's intro difficulty produces flat category effects; early features lack signal | + +## v4 requirements + +### R1 — Operational decision framing (capacity + value) + +**Problem:** v1–v3 frame lead scoring as pure classification. Real lead scoring is a **decision tool** — ranking leads by expected value, not just probability. + +**Requirement:** +- Include an `expected_acv` numeric feature (estimated annual contract value) available at snapshot time. +- The feature must be derived from the opportunity table (for leads with an opportunity by snapshot) or from account-level heuristics (employee band → ACV range midpoint) for leads without one. +- This enables students to compute `expected_value = P(conversion) × expected_acv` and practice ranking/top-K selection. + +**Engine change needed:** The snapshot builder must join opportunity ACV data gated by snapshot day, with a fallback to account-band heuristic ACV. + +### R2 — Safe temporal / momentum features + +**Problem:** v1–v3 engagement features are cumulative counts with no temporal shape. Real lead scoring uses recency and momentum signals. + +**Requirement:** +- Include exactly one momentum feature: `touches_week_1` (touches in days 0–7 after lead creation). +- This is strictly pre-snapshot (snapshot is at day 21+) and gives students a "first-week intensity" signal to compare against total touches. +- Additionally, `days_since_first_touch` (snapshot_day minus day of first touch) provides a lead-age signal. + +**Engine change needed:** The snapshot builder must compute windowed aggregates from event timestamps. + +### R3 — Structured missingness (not only MCAR) + +**Problem:** v1–v3 inject missingness randomly (MCAR). Real CRM data has structured gaps. + +**Requirement:** Implement three missingness patterns: +1. **Natural (structural):** `days_since_last_touch` is NaN when `total_touches == 0` (no touches recorded). Already exists but must be preserved. +2. **Conditional on source:** `web_sessions` is missing for ~15% of `sdr_outbound` leads (CRM tracking often not set up for outbound-sourced leads) but only ~2% of `inbound_marketing` leads. +3. **Role data gap:** `seniority` is missing for ~8% of `partner_referral` leads (referral partners don't always provide full contact details). + +**Engine change needed:** Missingness injection in the snapshot builder, conditioned on feature values. + +### R4 — Deliberate leakage trap + +**Problem:** Students need to practice identifying leakage, but v1–v3 either have accidental leakage (bad) or none at all (missed teaching opportunity). + +**Requirement:** +- Include one feature `total_touches_all` that counts **all** touches over the full 90-day window, not just up to snapshot. +- This feature is strongly predictive (uses future data) but not perfectly deterministic (it correlates with but doesn't fully determine conversion). +- The feature MUST be clearly labeled as "intentionally invalid — included for leakage discussion" in `RELEASE_v4.md` and the feature dictionary. +- The validation script must flag it, but the v4 build script intentionally includes it. +- The `BACKGROUND.md` / student instructions must NOT reveal the trap — students should discover it through EDA. + +**Engine change needed:** The snapshot builder computes a second touch count using the full horizon. + +### R5 — Reduce redundancy + +**Problem:** `total_touches = inbound_touches + outbound_touches` is a perfect linear dependency. Students may be confused by it, or models waste a degree of freedom. + +**Requirement:** +- Drop `total_touches` from v4. Keep `inbound_touches` and `outbound_touches` as the touch breakdown. +- Note: `total_touches_all` (the leakage trap from R4) is a different feature and is kept. +- Document this as a teaching point: "you can derive total from inbound + outbound." + +### R6 — Stronger category signal + +**Problem:** At intro difficulty, category conversion rates span only 2–11%. This makes the dataset nearly impossible to model well (AUC ~0.62). + +**Requirement:** +- The engine must produce category-level conversion rate spreads of at least 15–25% for key features (`contact_role`, `company_revenue`, `seniority`). +- Target baseline LR AUC: **0.70–0.85** (after snapshot + subsampling). +- This requires engine changes to the difficulty profile or mechanism weights, not just post-hoc manipulation. + +**Engine change needed:** Adjust intro difficulty profile or mechanism policy to produce wider category effects. + +### R7 — Robust automated validation + +**Requirement:** The v4 dataset must pass all of the following automated checks: + +| Check | Criterion | +|---|---| +| No banned columns | No `current_stage`, `funnel_stage`, `conversion_timestamp`, `is_sql` | +| No deterministic groups | For every feature value with n≥50: conversion rate in [2%, 98%] | +| Conversion rate | In [15%, 40%] | +| Baseline LR AUC | In [0.65, 0.90] (all features except leakage trap) | +| Leakage trap AUC boost | AUC with trap > AUC without trap by ≥0.03 | +| Missingness per column | Each column with nulls: 1–15% missing | +| Missingness structure | `web_sessions` missing rate for `sdr_outbound` > 3× rate for `inbound_marketing` | +| Row count | Exactly 1,000 | +| Column count | 16–18 (features + target) | +| Reproducibility | Same seed → identical output | + +## v4 target column set + +| # | Column | Type | Source | Notes | +|---|---|---|---|---| +| 1 | `industry` | categorical | account | 4 values | +| 2 | `region` | categorical | account | US, UK | +| 3 | `company_size` | categorical | account | 4 bands | +| 4 | `company_revenue` | categorical | account | 4 bands | +| 5 | `contact_role` | categorical | contact | 4 roles | +| 6 | `seniority` | categorical | contact | 5 levels (~8% missing for partner_referral) | +| 7 | `lead_source` | categorical | lead | 3 channels | +| 8 | `opportunity_created` | binary 0/1 | derived | Opp opened by snapshot day | +| 9 | `demo_completed` | binary 0/1 | derived | Demo done by snapshot day | +| 10 | `expected_acv` | numeric | derived | Opp ACV if available, else band midpoint (R1) | +| 11 | `inbound_touches` | integer | events ≤ snapshot | Inbound touchpoints | +| 12 | `outbound_touches` | integer | events ≤ snapshot | Outbound touchpoints | +| 13 | `touches_week_1` | integer | events ≤ day 7 | First-week touch intensity (R2) | +| 14 | `web_sessions` | integer | events ≤ snapshot | Sessions (~15% missing for outbound, ~2% inbound) | +| 15 | `sales_activities` | integer | events ≤ snapshot | Sales activities count | +| 16 | `days_since_last_touch` | float | events ≤ snapshot | Natural NaN when no touches | +| 17 | `total_touches_all` | integer | **ALL events** | ⚠️ LEAKAGE TRAP — uses full 90-day window | +| 18 | `converted` | binary 0/1 | target | Converted within 90 days | + +Total: 17 features + 1 target = 18 columns. + +## Non-goals for v4 + +- v4 does NOT require engine changes to the simulation loop itself (stage transitions, churn, conversion hazard). +- v4 does NOT change the relational bundle format or task splits. +- v4 does NOT require a new recipe — it uses `b2b_saas_procurement_v1` with adjusted difficulty tuning. +- v4 does NOT need to change the `student_public` / `research_instructor` exposure modes. diff --git a/docs/v4/validation_spec.md b/docs/v4/validation_spec.md new file mode 100644 index 0000000..4c79c62 --- /dev/null +++ b/docs/v4/validation_spec.md @@ -0,0 +1,108 @@ +# v4 Validation Specification + +## Overview + +v4 validation operates at two levels: +1. **Engine-level validation** (existing `leadforge validate` harness) — structural checks on bundles. +2. **Dataset-level validation** (new `scripts/validate_v4_dataset.py`) — checks specific to the simplified CSV output. + +This document specifies the dataset-level validation for v4. + +--- + +## Mandatory checks + +### Check 1: No banned columns + +The CSV must NOT contain any of: +- `current_stage`, `funnel_stage` — outcome-stage leakage +- `conversion_timestamp` — direct outcome +- `is_sql` — engine invariant creates deterministic groups +- `is_mql` — zero variance +- `lead_created_at` — timestamp that could be used to reverse-engineer temporal info +- Any column containing `_id` suffix (opaque identifiers, not features) + +**Implementation:** Set intersection check on column names. + +### Check 2: No deterministic feature groups + +For every feature (categorical AND binary), for every value with n ≥ 50: +- Conversion rate must be in [0.02, 0.98]. + +This catches: +- `reached_sql=0` → 0% (caught in v2) +- `has_opportunity=1` → 0% (caught in v2) +- Any future deterministic pattern + +**Implementation:** `groupby(feature)[target].agg(['mean', 'count'])`, filter to count ≥ 50, check bounds. + +### Check 3: Conversion rate realism + +- Overall conversion rate must be in [0.15, 0.40]. + +### Check 4: Baseline model AUC (without leakage trap) + +- Train a logistic regression on all features EXCEPT `total_touches_all`. +- AUC must be in [0.65, 0.90]. +- If AUC < 0.65: features lack signal (category effects too flat). +- If AUC > 0.90: likely residual leakage. + +### Check 5: Leakage trap effectiveness + +- Train a logistic regression with ALL features including `total_touches_all`. +- AUC must be at least 0.03 higher than the clean model from Check 4. +- If the trap doesn't boost AUC, it's not an effective teaching tool. + +### Check 6: Missingness structure + +- `web_sessions` must have nulls. +- Missing rate for `web_sessions` among `sdr_outbound` leads must be > 3× the rate among `inbound_marketing` leads. +- `seniority` must have nulls. +- Missing rate for `seniority` among `partner_referral` leads must be > 3× the rate among non-`partner_referral` leads. +- `days_since_last_touch` must have nulls. +- No column should have > 20% missing. + +### Check 7: Shape constraints + +- Exactly 1,000 rows. +- 18 columns (17 features + 1 target). + +### Check 8: Reproducibility + +- Running the build script twice with the same seed produces identical output (byte-level CSV comparison). + +--- + +## Warning checks (non-fatal) + +### Warning 1: Leakage trap is labeled + +- Check that the feature dictionary (if present) marks `total_touches_all` with `leakage_risk: True`. + +### Warning 2: Column redundancy + +- Warn if `inbound_touches + outbound_touches` correlates > 0.99 with any other column. + +### Warning 3: Low-variance features + +- Warn if any feature has < 3 unique values (excluding binary features). + +--- + +## Integration with existing validation + +The engine-level `leadforge validate` harness (`validation/bundle_checks.py`) continues to validate the full Parquet bundle. The v4 dataset validator is a separate script for the simplified CSV output. + +If engine changes add new features to `LEAD_SNAPSHOT_FEATURES`, the existing `validation/realism.py` checks (non-negative counts, valid booleans, stage diversity) automatically cover them. + +--- + +## Validator script interface + +```bash +python scripts/validate_v4_dataset.py lead_scoring_intro/lead_scoring_intro_v4.csv +``` + +Exit code 0 = all mandatory checks pass. Exit code 1 = at least one failure. + +Output format: structured report showing each check name, status (PASS/FAIL/WARN), and details.