diff --git a/.agent-plan.md b/.agent-plan.md index fd215c5..9226797 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -12,39 +12,35 @@ ## Next Up — v4 Lead Scoring Dataset -The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes followed by dataset build scripts. +The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes + a build pipeline, followed by dataset release. -See `docs/v4/implementation_plan.md` for full details. +See `docs/v4/design.md` for full details. -### v4-M0: Requirements + planning ⬜ (this PR) +### v4-M0: Planning + spike ⬜ (this PR) -- [x] `docs/v4/lead_scoring_v4_requirements.md` -- [x] `docs/v4/dataset_contract.md` -- [x] `docs/v4/validation_spec.md` -- [x] `docs/v4/engine_changes_spec.md` -- [x] `docs/v4/implementation_plan.md` -- [x] Updated `CLAUDE.md` with repo map + generation commands -- [x] Updated `AGENTS.md` with v4 implementation guide -- [x] Updated `.agent-plan.md` (this file) +- [x] `docs/v4/design.md` — consolidated requirements, contract, engine changes, implementation plan +- [x] `docs/v4/validation_spec.md` — automated validation checks +- [x] `docs/v4/planning_pr_review.md` — self-review and treatment plan +- [x] `scripts/spike_category_signal.py` — spike experiment validating category signal approach +- [x] Updated `CLAUDE.md`, `AGENTS.md`, `.agent-plan.md` -### v4-M1: Engine — category signal + windowed snapshots ⬜ +### v4-M1: Engine + build pipeline ⬜ -- [ ] Add `category_effect_scale` to difficulty profiles -- [ ] Apply scale in `mechanisms/policies.py` -- [ ] Add `snapshot_day` parameter to `render/snapshots.py` -- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv` -- [ ] Add new `FeatureSpec` entries to `schema/features.py` -- [ ] Tests for all changes -- [ ] Verify category spread ≥15% for key features at intro difficulty +Engine changes: +- [ ] Add `category_latent_correlations` to `difficulty_profiles.yaml` (intro profile) +- [ ] Apply correlations in `simulation/population.py` after initial latent sampling +- [ ] Add `snapshot_day` parameter to `render/snapshots.py` with windowed aggregation +- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv`, `total_touches_all` +- [ ] Add `FeatureSpec` entries to `schema/features.py` +- [ ] Tests for all engine changes (backward compat + v4 mode) -### v4-M2: Build pipeline + validation ⬜ - -- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness +Build pipeline: +- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness + subsampling - [ ] `scripts/validate_v4_dataset.py` — full validation per `docs/v4/validation_spec.md` -- [ ] Generate test dataset and verify all checks pass +- [ ] End-to-end: generate bundle → build CSV → validate → all checks pass - [ ] LR AUC 0.65–0.90 (without trap); ≥0.03 boost with trap -### v4-M3: Documentation + release ⬜ +### v4-M2: Documentation + release ⬜ - [ ] Generate `lead_scoring_intro_v4.csv` (in datasets-private repo) - [ ] Write `RELEASE_v4.md` @@ -62,7 +58,7 @@ See `docs/v4/implementation_plan.md` for full details. | M12: CLI `--json` flag | Deferred | No consumer needs it yet; add post-v4 | | M12: CLI `--strict` flag | Deferred | Per-check control is better than global flag | | M12: CLI help text polish | Deferred | Low priority vs dataset | -| M14: Sample bundle commit | Absorbed into v4-M3 | v4 dataset IS the sample | +| M14: Sample bundle commit | Absorbed into v4-M2 | v4 dataset IS the sample | | M14: Notebook 1 (inspecting world) | Deferred | Do after v4 ships | | M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this | | M14: Notebook 3 (public vs instructor) | Discarded | No current audience | @@ -83,11 +79,10 @@ See `docs/v4/implementation_plan.md` for full details. ## Context Pointers -- v4 requirements: `docs/v4/lead_scoring_v4_requirements.md` -- v4 dataset contract: `docs/v4/dataset_contract.md` -- v4 engine changes: `docs/v4/engine_changes_spec.md` +- v4 design (requirements, contract, engine changes, plan): `docs/v4/design.md` - v4 validation spec: `docs/v4/validation_spec.md` -- v4 implementation plan: `docs/v4/implementation_plan.md` +- v4 self-review: `docs/v4/planning_pr_review.md` +- Spike experiment: `scripts/spike_category_signal.py` - Existing roadmap: `docs/leadforge_implementation_plan.md` - CLI commands: `leadforge/cli/commands/` - Validation modules: `leadforge/validation/` diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 37495e0..56d794b 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -1,6 +1,6 @@ repos: - repo: https://github.com/astral-sh/ruff-pre-commit - rev: v0.4.5 + rev: v0.11.13 hooks: - id: ruff args: [--fix] diff --git a/AGENTS.md b/AGENTS.md index d67ced5..c72daa2 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -28,75 +28,6 @@ See CLAUDE.md for the full mandatory branch/PR workflow (branch → commit → u --- -## v4 Implementation Guide +## v4 Implementation -### What is v4? - -A pedagogically improved lead scoring dataset (single CSV) for an intro ML course. The engine changes are small and targeted. See `docs/v4/` for full specs. - -### Implementation order - -``` -v4-M0 (planning PR — already done) - └── v4-M1: engine changes (category signal + windowed snapshots) - └── v4-M2: build pipeline + validation scripts - └── v4-M3: dataset generation + release docs -``` - -### Key files to modify per milestone - -**v4-M1 (engine):** -- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale` -- `leadforge/mechanisms/policies.py` — apply scale to categorical influences -- `leadforge/render/snapshots.py` — add `snapshot_day` param, windowed aggregation -- `leadforge/schema/features.py` — add new FeatureSpec entries -- Tests in `tests/mechanisms/` and `tests/render/` - -**v4-M2 (build pipeline):** -- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap -- `scripts/validate_v4_dataset.py` (new) — dataset-level validation -- These live in the leadforge repo (not datasets-private) - -**v4-M3 (release):** -- Work in `leadforge-datasets-private` repo -- `lead_scoring_intro/lead_scoring_intro_v4.csv` -- `lead_scoring_intro/RELEASE_v4.md` - -### Coding conventions for v4 - -1. **Backward compatibility:** All engine changes must default to current behavior. New parameters must have defaults that produce identical output when unset. -2. **No simulation loop changes:** Do not modify the daily step logic in `engine.py`. v4 changes are in mechanism weights and snapshot rendering only. -3. **Temporal correctness:** Every feature computation must be explicitly gated by snapshot day. Use `event_timestamp <= lead_created_at + snapshot_day` — never `<`. -4. **Test coverage:** Every new parameter and feature must have unit tests. Test both `snapshot_day=None` (backward compat) and `snapshot_day=21` (v4 mode). -5. **Determinism:** All new stochastic operations must use seeded RNG. Verify with a determinism test (same seed → identical output). - -### Validation checklist for v4 dataset - -Before declaring v4-M2 complete, the dataset must pass: - -- [ ] 1,000 rows, 18 columns -- [ ] 30% conversion rate (±1%) -- [ ] No deterministic groups (n≥50 at 0% or 100% conversion) -- [ ] LR AUC 0.65–0.90 (without leakage trap) -- [ ] LR AUC boost ≥0.03 when leakage trap included -- [ ] `web_sessions` missingness: outbound rate > 3× inbound rate -- [ ] `seniority` missingness: partner_referral rate > 3× others -- [ ] Reproducible with seed 42 -- [ ] `total_touches_all` uses full 90-day data (confirmed by AUC boost) - -### How to test engine changes locally - -```bash -# Quick smoke test: generate a small bundle and inspect -leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 --difficulty intro --n-leads 1000 --out /tmp/test_bundle -leadforge validate /tmp/test_bundle - -# Check category signal spread -python -c " -import pandas as pd -df = pd.read_parquet('/tmp/test_bundle/tasks/converted_within_90_days/train.parquet') -for col in ['role_function', 'seniority', 'estimated_revenue_band']: - rates = df.groupby(col)['converted_within_90_days'].mean() - print(f'{col}: spread={rates.max()-rates.min():.1%}') -" -``` +For v4 dataset design, engine changes, validation spec, and implementation plan, see `docs/v4/design.md` and `docs/v4/validation_spec.md`. diff --git a/CLAUDE.md b/CLAUDE.md index 1e3ae7e..c45eb93 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -349,11 +349,9 @@ Exception: deliberately included leakage traps (e.g., `total_touches_all` in v4) ## v4 Dataset Plan The current focus is producing a v4 lead scoring intro dataset. See `docs/v4/` for: -- `lead_scoring_v4_requirements.md` — what v4 must achieve -- `dataset_contract.md` — schema contract and temporal gates -- `engine_changes_spec.md` — what changes in the engine +- `design.md` — requirements, contract, engine changes, implementation plan (single source of truth) - `validation_spec.md` — automated validation checks -- `implementation_plan.md` — milestone breakdown +- `planning_pr_review.md` — self-review of the planning PR and treatment plan --- @@ -361,4 +359,4 @@ The current focus is producing a v4 lead scoring intro dataset. See `docs/v4/` f - Design decisions: `docs/leadforge_design_doc.md` - Architecture/spec: `docs/leadforge_architecture_spec.md` - Implementation roadmap: `docs/leadforge_implementation_plan.md` -- v4 dataset plan: `docs/v4/implementation_plan.md` +- v4 dataset plan: `docs/v4/design.md` diff --git a/docs/v4/dataset_contract.md b/docs/v4/dataset_contract.md deleted file mode 100644 index 4b7cb0a..0000000 --- a/docs/v4/dataset_contract.md +++ /dev/null @@ -1,72 +0,0 @@ -# v4 Dataset Contract - -## Snapshot definition - -- **Snapshot day:** Day 21 after `lead_created_at` (configurable, default 21). -- **Observation window:** Days 0–21 inclusive. All features computed from events in this window only. -- **Prediction horizon:** Days 22–90. The target `converted` reflects whether `closed_won` occurs in the full 90-day window. -- **Temporal guarantee:** No feature (except the explicitly marked leakage trap) uses information from after the snapshot day. - -## What is pre-snapshot (valid for features) - -| Data source | Temporal gate | -|---|---| -| Account attributes | Static — always valid | -| Contact attributes | Static — always valid | -| Lead metadata (source, etc.) | Lead creation — always valid | -| Touch events | `touch_timestamp ≤ lead_created_at + snapshot_day` | -| Session events | `session_timestamp ≤ lead_created_at + snapshot_day` | -| Sales activity events | `activity_timestamp ≤ lead_created_at + snapshot_day` | -| Opportunity records | `opportunity.created_at ≤ lead_created_at + snapshot_day` | -| ACV estimates | From opportunity if available by snapshot; else account heuristic | - -## What is post-snapshot (invalid for features) - -| Data | Why invalid | -|---|---| -| `current_stage` at day 90 | Contains `closed_won` / `closed_lost` — outcome data | -| `is_sql` (final state flag) | Engine invariant: `is_sql=False` → never converts. Deterministic. | -| `conversion_timestamp` | Direct outcome information | -| Touch/session/activity events after snapshot day | Future data | -| Opportunity close outcome | Post-outcome | -| `total_touches_all` | ⚠️ Intentional leakage trap — counts full 90-day touches | - -## Leakage trap contract - -The feature `total_touches_all` deliberately violates the snapshot boundary: -- It counts touches over the **full 90-day simulation**, not just up to snapshot. -- It is included to teach students about temporal leakage detection. -- It must be clearly marked in the feature dictionary and release notes. -- The validation script must detect it and flag it (but not fail the build). -- Removing this feature should drop AUC by ≥0.03. - -## Missingness contract - -| Column | Pattern | Rate | Condition | -|---|---|---|---| -| `days_since_last_touch` | Structural | Natural | NaN when `total touches == 0` by snapshot | -| `web_sessions` | Source-conditional | ~15% for `sdr_outbound`, ~2% for `inbound_marketing`, ~5% for `partner_referral` | CRM tracking gaps | -| `seniority` | Source-conditional | ~8% for `partner_referral`, ~1% for others | Referral partners omit contact details | -| `days_since_last_touch` | Additional MCAR | ~3% | Random CRM logging gaps (on top of structural) | - -## Target definition - -``` -converted = 1 if lead reached closed_won within 90 days of lead_created_at -converted = 0 otherwise (including closed_lost, still in funnel, churned) -``` - -The target is derived from simulated events, never directly sampled. - -## Subsampling contract - -- Source bundle: 5,000 leads generated with `b2b_saas_procurement_v1`, seed 42, difficulty intro. -- Stratified subsampling to 1,000 rows at ~30% conversion rate. -- All negatives retained (up to 700); positives downsampled. -- Subsampling preserves within-class feature distributions. - -## Reproducibility - -- Seed: 42 (or documented if changed). -- All stochastic operations use `np.random.RandomState(seed)` or derived substreams. -- Same (seed, recipe, leadforge version) → byte-identical CSV output. diff --git a/docs/v4/design.md b/docs/v4/design.md new file mode 100644 index 0000000..c6971ee --- /dev/null +++ b/docs/v4/design.md @@ -0,0 +1,324 @@ +# v4 Lead Scoring Dataset — Design Document + +> Single source of truth for the v4 dataset: requirements, contract, engine changes, and implementation plan. +> Validation checks are in the companion `validation_spec.md`. + +--- + +## Prior versions and lessons + +| Version | Key issue | Lesson | +|---|---|---| +| v1 | `funnel_stage` contained `closed_won`/`closed_lost` — perfect leakage | Must validate that no single feature determines the target | +| v2 | Snapshot at day 90 with 90-day target — post-mortem, not prediction | Snapshot must be strictly earlier than outcome horizon | +| v2 | `reached_sql=0` → 0% conversion (n=127); `has_opportunity=1` → 0% (n=235) | Binary proxies from engine invariants create deterministic groups | +| v3 | Day-21 snapshot + non-deterministic proxies — clean but AUC only 0.62 | Engine's intro difficulty produces flat category effects; early features lack signal | + +--- + +## Requirements + +### R1 — Operational decision framing (expected ACV) + +Include an `expected_acv` numeric feature so students can compute `P(conversion) × expected_acv` and practice value-aware ranking. + +**ACV derivation (single source of truth):** + +| Condition | Value | +|---|---| +| Opportunity created by snapshot day | Opportunity's `estimated_acv` | +| No opportunity, `estimated_revenue_band` known | Band midpoint (see table below) | +| No opportunity, band unknown | NaN | + +**Revenue band → ACV midpoint mapping:** + +| Band | Midpoint ($k) | +|---|---| +| $1M–$10M | 25 | +| $10M–$50M | 55 | +| $50M–$200M | 85 | +| $200M+ | 140 | + +These midpoints are derived from the engine's `_EMPLOYEE_ACV_RANGES` in `simulation/engine.py`, which maps employee bands to ACV ranges. Since the dataset exposes `estimated_revenue_band` (not employee band), the midpoints approximate the overlap between revenue bands and the engine's ACV sampling. + +### R2 — Safe temporal / momentum features + +- `touches_week_1`: touches in days 0–7 after lead creation. Strictly pre-snapshot. +- `days_since_first_touch`: `snapshot_day - first_touch_day`. NaN if no touches. + +### R3 — Structured missingness (MAR, not only MCAR) + +Three patterns, each with a pedagogical rationale: + +| Column | Pattern | Rates | Rationale | +|---|---|---|---| +| `days_since_last_touch` | Structural | NaN when no touches by snapshot | Natural — no event to measure from | +| `web_sessions` | Source-conditional | ~15% `sdr_outbound`, ~2% `inbound`, ~5% `partner` | CRM web tracking often not configured for outbound leads | +| `seniority` | Source-conditional | ~8% `partner_referral`, ~1% others | Referral partners don't always provide full contact details | +| `days_since_last_touch` | Additional MCAR | ~3% on top of structural | Random CRM logging gaps | + +**Why these specific rates:** They are chosen to be detectable at n≈1000 with a chi-squared test at p<0.01 (the outbound/inbound ratio for `web_sessions` is ~7.5×, well above the 3× detection threshold), but not so extreme that imputation becomes trivial. These are tunable parameters, not ground truth — the validation spec checks the *ratio* (>3×), not the exact rates. + +### R4 — Deliberate leakage trap + +`total_touches_all` counts ALL touches over the full 90-day window, violating the snapshot boundary. It is strongly predictive but not deterministic. Must be labeled in release notes and feature dictionary but NOT revealed in student-facing `BACKGROUND.md`. + +### R5 — Reduce redundancy + +Drop `total_touches` (= `inbound_touches + outbound_touches`). Keep the breakdown. + +### R6 — Stronger category signal + +See "Engine change 1" below. Target: ≥15% spread for at least two category features; baseline LR AUC 0.65–0.90. + +### R7 — Robust automated validation + +See `validation_spec.md`. + +--- + +## Target column set + +| # | Column | Type | Source | Notes | +|---|---|---|---|---| +| 1 | `industry` | categorical | account | 4 values | +| 2 | `region` | categorical | account | US, UK | +| 3 | `company_size` | categorical | account | 4 bands | +| 4 | `company_revenue` | categorical | account | 4 bands | +| 5 | `contact_role` | categorical | contact | 4 roles | +| 6 | `seniority` | categorical | contact | 5 levels (~8% missing for partner_referral) | +| 7 | `lead_source` | categorical | lead | 3 channels | +| 8 | `opportunity_created` | binary 0/1 | derived | Opp opened by snapshot day | +| 9 | `demo_completed` | binary 0/1 | derived | Demo done by snapshot day | +| 10 | `expected_acv` | numeric | derived | See R1 ACV derivation table | +| 11 | `inbound_touches` | integer | events ≤ snapshot | Inbound touchpoints | +| 12 | `outbound_touches` | integer | events ≤ snapshot | Outbound touchpoints | +| 13 | `touches_week_1` | integer | events ≤ day 7 | First-week touch intensity | +| 14 | `web_sessions` | integer | events ≤ snapshot | Sessions (~15% missing for outbound) | +| 15 | `sales_activities` | integer | events ≤ snapshot | Sales activities count | +| 16 | `days_since_last_touch` | float | events ≤ snapshot | Natural NaN when no touches | +| 17 | `total_touches_all` | integer | **ALL events** | Leakage trap — full 90-day window | +| 18 | `converted` | binary 0/1 | target | Converted within 90 days | + +Total: 17 features + 1 target = 18 columns. + +--- + +## Snapshot contract + +- **Snapshot day:** 21 (configurable). +- **Observation window:** Days 0–21 inclusive. +- **Prediction horizon:** Days 22–90. +- **Temporal guarantee:** No feature except `total_touches_all` uses post-snapshot data. + +| Data source | Temporal gate | +|---|---| +| Account/contact/lead attributes | Static — always valid | +| Touch/session/activity events | `timestamp ≤ lead_created_at + snapshot_day` | +| Opportunity records | `opportunity.created_at ≤ lead_created_at + snapshot_day` | + +### Target definition + +``` +converted = 1 if lead reached closed_won within 90 days of lead_created_at +converted = 0 otherwise (including closed_lost, still in funnel, churned) +``` + +Derived from simulated events, never directly sampled. + +### Subsampling + +- Source bundle: 5,000 leads, `b2b_saas_procurement_v1`, seed 42, difficulty intro. +- Stratified subsampling to 1,000 rows at ~30% conversion. +- Subsampling uses `np.random.RandomState(seed)` for reproducibility. + +--- + +## Engine changes + +### Change 1: Stronger category signal via population-level correlation + +#### Problem + +Observable categories (seniority, revenue band, lead source) are drawn **independently** from latent traits in `population.py`. The conversion hazard uses only latent traits (via `LatentScore`). Therefore category → conversion correlation is near-zero by construction. + +The v3 dataset confirms this: category spreads are 2–11% and LR AUC is 0.62. + +Note: `CategoricalInfluence` exists in `mechanisms/categorical.py` but is **never wired** into `assign_mechanisms()` or the simulation loop. The `MechanismContext` only passes latent traits, stage, time, and dwell days — not observable categories. + +#### Solution + +Correlate observable categories with latent traits during population generation. This is a population-layer change — no simulation loop modifications needed. + +Add a `category_latent_correlations` mapping to the difficulty profile, applied in `build_population()` after initial latent sampling: + +| Observable | Latent trait | Boost per value | +|---|---|---| +| `seniority` | `latent_contact_authority` | individual_contributor: −0.27, manager: −0.09, director: +0.09, vp: +0.22, c_suite: +0.36 | +| `estimated_revenue_band` | `latent_account_fit` | $1M–$10M: −0.18, $10M–$50M: 0.0, $50M–$200M: +0.18, $200M+: +0.32 | +| `lead_source` | `latent_engagement_propensity` | sdr_outbound: −0.14, inbound_marketing: +0.09, partner_referral: +0.22 | + +These are the scale=1.8 boosts from the spike experiment (`scripts/spike_category_signal.py`). + +#### Spike experiment results (seed 42, 5000 leads, fit_dominant motif) + +| Setting | AUC | seniority spread | revenue spread | role_function spread | +|---|---|---|---|---| +| Baseline (no correlation) | 0.663 | 9.5% | 10.8% | 11.0% | +| Scale 1.0 | 0.650 | 5.5% | 10.3% | 3.7% | +| **Scale 1.8** | **0.694** | **22.1%** | **15.2%** | 1.7% | +| Scale 2.5 | 0.701 | 11.9% | 15.1% | 3.2% | + +**Observations:** +- Scale 1.8 gives AUC 0.694, within the [0.65, 0.90] target. +- `seniority` and `estimated_revenue_band` exceed the 15% spread target. +- `role_function` gets **no boost** from this approach because there is no natural latent trait to correlate it with. The spike shows role_function spread is driven entirely by noise and varies widely across runs (1.7%–11%). +- The `fit_dominant` motif gives zero weight to `latent_contact_authority`, so the seniority boost only works through its indirect correlation with other traits at population level. Different motif families will produce different spread profiles. +- At scale 2.5, seniority spread *decreases* (11.9%) due to [0, 1] clamp saturation. + +#### Important caveats + +1. The spike tested only `fit_dominant` motif (seed 42). Other motifs weight different latent traits, so the same boosts will produce different category spreads. The implementation should test across all 5 motif families. +2. `role_function` signal remains weak. If role_function spread ≥15% is required, a separate mechanism is needed (either role-specific latent biases in `population.py` or wiring `CategoricalInfluence` into the conversion score, which would require sim loop changes). For v4, we accept that not all categories will have strong signal. +3. The boost values are empirical, not principled. They should be treated as starting points, not final values. + +#### Files affected + +- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_latent_correlations` +- `leadforge/simulation/population.py` — apply correlations after initial latent sampling +- `tests/simulation/test_population.py` — test correlation application, backward compat + +### Change 2: Windowed snapshot (`snapshot_day` parameter) + +Add `snapshot_day: int | None` to `build_snapshot()`. When set, all event aggregations filter to `timestamp ≤ lead_created_at + snapshot_day`. Default `None` preserves current behavior (full horizon). + +New features computed in the snapshot: +- `touches_week_1` — count touches where `days_after_creation ≤ 7` +- `days_since_first_touch` — `snapshot_day - first_touch_day` (NaN if no touches) +- `expected_acv` — see R1 derivation table above +- `total_touches_all` — full-horizon touch count (ignoring snapshot gate) + +#### Files affected + +- `leadforge/render/snapshots.py` — add `snapshot_day`, windowed filtering, new features +- `leadforge/schema/features.py` — add `FeatureSpec` entries +- `tests/render/test_snapshots.py` — test windowed aggregation + +### Change 3: Structured missingness + leakage trap (build script only) + +These are NOT engine changes. The build script (`scripts/build_v4_snapshot.py`) applies missingness injection and computes the leakage trap after snapshot construction. + +#### Files affected + +- `scripts/build_v4_snapshot.py` (new) +- No changes to `leadforge/` core for missingness + +--- + +## Known limitations and workarounds + +### `is_sql=False → never converts` (engine invariant) + +The simulation engine requires leads to pass through SQL stage before converting. This creates a deterministic group: `is_sql=False` → 0% conversion. v4 works around this by: + +- **Excluding** `is_sql` and `reached_sql` from the column set entirely +- Using `opportunity_created` and `demo_completed` as non-deterministic binary proxies instead + +A proper fix would modify the conversion hazard in `engine.py` to allow rare direct conversions. This is tracked as a deferred item — it would benefit v5+ but is out of scope for v4. + +### `role_function` lacks signal + +The population-level correlation approach provides no mechanism for `role_function` to influence conversion (there is no natural latent trait to map it to). role_function spread in the v4 dataset will be noise-driven (2–11%). This is acceptable for an intro course but should be addressed in a future engine revision. + +--- + +## Tuning protocol + +If validation checks fail during implementation, use these adjustments: + +| Failure | Adjustment | +|---|---| +| AUC < 0.65 | Increase boost scale (try 2.0, 2.5, 3.0) | +| AUC > 0.90 | Decrease boost scale or add noise to latent correlations | +| Leakage trap boost < 0.03 | Widen snapshot gap (try day 14 instead of 21) to increase information delta | +| Subsampling destroys signal | Increase `n_leads` from 5000 to 10000 before subsampling | +| Category spread < 15% for seniority/revenue | Increase individual boost magnitudes for that feature | +| Deterministic group detected | Check which feature/value, adjust boost or drop the feature | + +--- + +## Implementation plan + +### Milestone structure + +v4 work is split into **two implementation milestones** plus the planning PR: + +``` +v4-M0 (this PR — planning + spike) + └── v4-M1: engine + build pipeline (single PR) + └── v4-M2: dataset generation + release docs +``` + +v4-M1 merges the engine changes and build pipeline into one milestone because: +- The engine change (population-level correlations) cannot be validated without the build script +- The build script depends on the snapshot_day parameter from the engine change +- A single PR with both, validated end-to-end, is more reviewable + +### v4-M1: Engine + build pipeline + +**Deliverables:** +1. `difficulty_profiles.yaml` — `category_latent_correlations` for intro profile +2. `simulation/population.py` — apply correlations during population generation +3. `render/snapshots.py` — `snapshot_day` parameter, windowed aggregation, new features +4. `schema/features.py` — new `FeatureSpec` entries +5. `scripts/build_v4_snapshot.py` — day-21 snapshot + missingness + leakage trap + subsampling +6. `scripts/validate_v4_dataset.py` — validation per `validation_spec.md` +7. Tests for all changes + +**Acceptance criteria:** +- [ ] No correlation (`category_latent_correlations` absent or empty) → identical output to current engine +- [ ] Scale 1.8 correlations → seniority and revenue_band spread ≥15% +- [ ] `snapshot_day=21` correctly filters events +- [ ] `touches_week_1` counts only days 0–7 +- [ ] `expected_acv` uses ACV derivation table +- [ ] Build script produces 1000 rows × 18 columns at 30% conversion +- [ ] Validation script passes all mandatory checks +- [ ] LR AUC in [0.65, 0.90] without trap; ≥0.03 boost with trap +- [ ] All existing tests pass +- [ ] Reproducible with seed 42 + +### v4-M2: Documentation + release + +**Deliverables (in leadforge-datasets-private):** +1. `lead_scoring_intro/lead_scoring_intro_v4.csv` +2. `lead_scoring_intro/RELEASE_v4.md` +3. Updated README + +**Deliverables (in leadforge):** +1. Updated `.agent-plan.md` + +**Acceptance criteria:** +- [ ] CSV passes all validation checks +- [ ] RELEASE_v4.md documents snapshot day, target definition, changes from v3, leakage trap +- [ ] Previous versions marked as superseded + +### Relationship to existing roadmap + +| Existing milestone | v4 interaction | +|---|---| +| M0–M11 | Complete, no changes | +| M12 (CLI polish) | **Deferred** — low priority vs v4 | +| M14 (Sample datasets + notebooks) | **Absorbed** — v4 dataset IS the sample | +| M15 (Docs polish + v1.0 RC) | **Deferred** — do after v4 | + +Discarded: M14 notebooks 3–4 (no current audience). + +--- + +## Non-goals + +- v4 does NOT modify the simulation loop (`engine.py` daily step logic). +- v4 does NOT change the relational bundle format or task splits. +- v4 does NOT add new recipes. +- v4 does NOT change exposure modes. +- v4 does NOT fix the `is_sql=False → never converts` invariant (deferred). diff --git a/docs/v4/engine_changes_spec.md b/docs/v4/engine_changes_spec.md deleted file mode 100644 index d3712ca..0000000 --- a/docs/v4/engine_changes_spec.md +++ /dev/null @@ -1,170 +0,0 @@ -# v4 Engine Changes Specification - -## Overview - -v4 requires **two categories** of changes to the leadforge codebase: -1. **Mechanism / difficulty tuning** — make intro difficulty produce stronger category-level signal. -2. **Snapshot builder enhancements** — compute windowed aggregates, ACV derivation, structured missingness, and the leakage trap feature. - -Neither category requires changes to the simulation loop itself (`engine.py`'s daily step logic). The simulation produces the same event stream; we change how features are derived from it. - ---- - -## Change 1: Stronger category signal at intro difficulty - -### Problem - -The current mechanism policy (`mechanisms/policies.py`) produces conversion rates that are nearly uniform across categories at intro difficulty. For example, `contact_role` spreads only 11% (25.6%–36.7% after subsampling). This yields LR AUC ~0.62, which is too low for a useful teaching dataset. - -### Root cause - -The `assign_mechanisms()` function builds a `LatentScore` with weights that are quite flat across categories. The intro difficulty profile specifies `signal_strength: 0.90` but this controls noise scale, not the magnitude of category effects. - -### Solution - -Add **category effect multipliers** to the difficulty profile YAML: - -```yaml -intro: - # ... existing fields ... - category_effect_scale: 1.8 # amplify category → latent score effects -``` - -In `mechanisms/policies.py`, scale the `CategoricalInfluence` weights by `category_effect_scale` when building the `LatentScore`. This widens the gap between, say, `vp_finance` and `it_director` conversion rates without changing the overall noise structure. - -### Target outcome - -After this change + subsampling to 30%, category spreads should be: -- `contact_role`: ≥15% spread -- `company_revenue`: ≥12% spread -- `seniority`: ≥10% spread -- Baseline LR AUC: 0.70–0.85 - -### Files affected - -- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale` -- `leadforge/mechanisms/policies.py` — use `category_effect_scale` when building categorical influences -- `tests/mechanisms/test_policies.py` — test that different scales produce different spread - -### Risk - -Low. The change is additive (new config field with a default of 1.0 for backward compatibility). Existing tests continue to pass at `category_effect_scale=1.0`. - ---- - -## Change 2: Snapshot builder — windowed aggregates and new features - -### Problem - -The current `render/snapshots.py` computes all aggregates over the full simulation horizon. v4 needs aggregates gated by a configurable snapshot day, plus new derived features. - -### Solution - -Add a new function or extend `build_snapshot()` to accept a `snapshot_day` parameter: - -```python -def build_snapshot( - result: SimulationResult, - population: PopulationResult, - horizon_days: int = 90, - snapshot_day: int | None = None, # NEW — default None means use horizon_days -) -> pd.DataFrame: -``` - -When `snapshot_day` is set, all event aggregations filter to events within `[lead_created_at, lead_created_at + snapshot_day]`. - -### New features to compute - -| Feature | Computation | Notes | -|---|---|---| -| `touches_week_1` | Count touches where `days_after_creation ≤ 7` | Momentum signal | -| `days_since_first_touch` | `snapshot_day - first_touch_day` (NaN if no touches) | Lead age signal | -| `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature | -| `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap | - -### Files affected - -- `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations -- `leadforge/schema/features.py` — add new `FeatureSpec` entries for the new columns -- `tests/render/test_snapshots.py` — test windowed aggregation correctness - -### Risk - -Medium. The snapshot builder is well-tested but core to correctness. The `snapshot_day` parameter should be additive (default `None` preserves existing behavior). New features are computed alongside existing ones. - ---- - -## Change 3: Structured missingness injection - -### Problem - -Current missingness is MCAR (random injection). v4 needs conditional missingness. - -### Solution - -Add a missingness injection step to the v4 build script (NOT to the engine's `build_snapshot`). This keeps the engine's output clean and makes missingness a dataset-packaging concern. - -The build script (`scripts/build_v4_snapshot.py`) applies missingness after snapshot construction: - -```python -def inject_missingness(df: pd.DataFrame, rng: np.random.RandomState) -> pd.DataFrame: - # 1. web_sessions: 15% missing for sdr_outbound, 2% inbound, 5% partner - for source, rate in [("sdr_outbound", 0.15), ("inbound_marketing", 0.02), ("partner_referral", 0.05)]: - mask = (df["lead_source"] == source) & (rng.random(len(df)) < rate) - df.loc[mask, "web_sessions"] = np.nan - - # 2. seniority: 8% missing for partner_referral, 1% for others - partner_mask = (df["lead_source"] == "partner_referral") & (rng.random(len(df)) < 0.08) - other_mask = (df["lead_source"] != "partner_referral") & (rng.random(len(df)) < 0.01) - df.loc[partner_mask | other_mask, "seniority"] = np.nan - - # 3. days_since_last_touch: additional 3% MCAR on top of structural NaN - dslt_mask = rng.random(len(df)) < 0.03 - df.loc[dslt_mask, "days_since_last_touch"] = np.nan - - return df -``` - -### Files affected - -- `scripts/build_v4_snapshot.py` (new) — missingness injection -- No changes to `leadforge/` core modules for missingness - -### Risk - -Low. Missingness is applied post-generation, outside the engine. - ---- - -## Change 4: Leakage trap feature - -### Problem - -Students need a feature that looks valid but violates temporal boundaries. - -### Solution - -The v4 build script computes `total_touches_all` by counting ALL touches in the full 90-day window (not gated by snapshot day). This is computed alongside the snapshot but uses different temporal filtering. - -### Files affected - -- `scripts/build_v4_snapshot.py` — compute `total_touches_all` from full event stream -- Feature dictionary and release notes — mark as leakage trap - -### Risk - -None to the engine. The trap is a build-script concern. - ---- - -## Summary of engine vs. script changes - -| Change | Where | Risk | -|---|---|---| -| Category effect scaling | `leadforge/` core (mechanisms, difficulty profiles) | Low | -| Snapshot `snapshot_day` parameter | `leadforge/` core (render/snapshots) | Medium | -| New features (ACV, momentum, first_touch) | `leadforge/` core (render/snapshots, schema/features) | Medium | -| Structured missingness | `scripts/` (build script only) | Low | -| Leakage trap | `scripts/` (build script only) | None | - -Total engine-side changes: ~200–400 lines across 4–5 files. Build script: ~250 lines new. diff --git a/docs/v4/implementation_plan.md b/docs/v4/implementation_plan.md deleted file mode 100644 index 89b8238..0000000 --- a/docs/v4/implementation_plan.md +++ /dev/null @@ -1,172 +0,0 @@ -# v4 Implementation Plan - -## Overview - -This plan implements the v4 lead scoring dataset in 4 milestones across 4–6 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria. - -## Relationship to existing roadmap - -v4 work slots into the existing leadforge roadmap as follows: - -| Existing milestone | Status | v4 interaction | -|---|---|---| -| M0–M11 | ✅ Complete | No changes needed | -| M12 (CLI polish) | ⬜ Planned | **Deferred** — low priority vs v4 dataset needs. Integrate after v4. | -| M13 (Validation harness) | ✅ Implemented as M11 | v4 extends with dataset-level validation | -| M14 (Sample datasets + notebooks) | ⬜ Planned | **Absorbed into v4-M3** — v4 dataset IS the sample dataset | -| M15 (Docs polish + v1.0 RC) | ⬜ Planned | **Deferred** — do after v4 ships | - -### Explicitly discarded items - -| Item | Rationale | -|---|---| -| M12 `--json` flag for inspect/validate | Nice-to-have; no dataset consumer needs it yet. Can add later. | -| M12 `--strict` flag for validate | Validation strictness is better controlled per-check, not globally. | -| M14 Notebook 3 (public vs instructor comparison) | No current audience for this; instructor mode is not used in the course. | -| M14 Notebook 4 (recipe customization walkthrough) | Premature — recipe system is stable but not user-facing yet. | - -### Explicitly kept / integrated items - -| Item | How it maps to v4 | -|---|---| -| M14 Sample bundle generation | v4-M2 generates the source bundle | -| M14 Lead-scoring baseline notebook | v4-M3 includes a validation notebook or script | -| M15 Docs audit | v4-M0 updates CLAUDE.md and AGENTS.md; v4-M3 produces RELEASE_v4.md | - ---- - -## v4 Milestones - -### v4-M0: Requirements, contract, and agent instructions - -**Goal:** Establish the v4 dataset contract and update repo documentation so implementation can begin immediately. - -**Deliverables:** -- `docs/v4/lead_scoring_v4_requirements.md` — full requirements -- `docs/v4/dataset_contract.md` — schema contract, temporal gates, missingness -- `docs/v4/validation_spec.md` — automated check specifications -- `docs/v4/engine_changes_spec.md` — what changes where and why -- `docs/v4/implementation_plan.md` — this file -- Updated `CLAUDE.md` — repository map, generation/validation commands -- Updated `AGENTS.md` — implementation conventions for v4 work -- Updated `.agent-plan.md` — reflects v4 as next work - -**Acceptance criteria:** -- [ ] All docs are internally consistent -- [ ] CLAUDE.md contains repo map and commands -- [ ] .agent-plan.md points to v4 milestones -- [ ] No contradictions with existing architecture docs - -**PR:** This PR (the planning PR). - ---- - -### v4-M1: Engine — category signal tuning + snapshot enhancements - -**Goal:** Make the engine produce datasets with stronger category signal and support windowed snapshot computation. - -**Deliverables:** -1. `difficulty_profiles.yaml` — add `category_effect_scale: 1.8` to intro profile -2. `mechanisms/policies.py` — apply `category_effect_scale` to categorical influence weights -3. `render/snapshots.py` — add optional `snapshot_day` parameter for windowed aggregation -4. `schema/features.py` — add `FeatureSpec` entries for new columns (`touches_week_1`, `days_since_first_touch`, `expected_acv`) -5. Tests for all changes - -**Acceptance criteria:** -- [ ] `category_effect_scale=1.0` produces identical output to current engine (backward compat) -- [ ] `category_effect_scale=1.8` produces category spreads ≥15% for `contact_role` -- [ ] `snapshot_day=21` correctly filters events to first 21 days -- [ ] `touches_week_1` counts only days 0–7 touches -- [ ] `expected_acv` uses opportunity ACV when available, else band midpoint -- [ ] All existing tests pass -- [ ] New tests cover the new parameters - -**Estimated size:** ~400 lines diff across 5 files + tests. - -**PR:** Single PR: `feat: v4 engine — category signal tuning + windowed snapshots` - ---- - -### v4-M2: Build pipeline — v4 snapshot builder + structured missingness - -**Goal:** Create the v4 build script that transforms a generated bundle into the final CSV. - -**Deliverables:** -1. `scripts/build_v4_snapshot.py` — snapshot builder with: - - Day-21 windowed features - - Leakage trap feature (`total_touches_all`) - - Structured missingness injection - - Stratified subsampling to 1,000 rows / 30% conversion - - Column selection and renaming -2. `scripts/validate_v4_dataset.py` — validation script per validation spec -3. Generated `lead_scoring_intro_v4.csv` (in datasets repo, not leadforge) - -**Acceptance criteria:** -- [ ] Build script produces 1,000 rows × 18 columns -- [ ] Conversion rate is 30% (±1%) -- [ ] `total_touches_all` uses full 90-day data (leakage trap) -- [ ] `web_sessions` missing rate for outbound > 3× inbound rate -- [ ] `seniority` missing rate for partner_referral > 3× others -- [ ] `days_since_last_touch` has structural + injected NaNs -- [ ] Validation script passes all mandatory checks -- [ ] Baseline LR AUC (without trap) in [0.65, 0.90] -- [ ] LR AUC boost with trap ≥ 0.03 -- [ ] No deterministic groups (n≥50 at 0% or 100%) -- [ ] Reproducible with seed 42 - -**Estimated size:** ~350 lines (build script) + ~200 lines (validator). - -**PR:** Single PR: `feat: v4 build pipeline + validation` - ---- - -### v4-M3: Documentation + release - -**Goal:** Produce the final dataset files and release documentation. - -**Deliverables (in leadforge-datasets-private repo):** -1. `lead_scoring_intro/lead_scoring_intro_v4.csv` -2. `lead_scoring_intro/RELEASE_v4.md` -3. Updated `lead_scoring_intro/BACKGROUND.md` (if needed for v4 framing) -4. Updated `README.md` (dataset index) - -**Deliverables (in leadforge repo):** -1. Updated `.agent-plan.md` reflecting completion - -**Acceptance criteria:** -- [ ] CSV passes all validation checks -- [ ] RELEASE_v4.md documents snapshot day, target definition, changes from v3, leakage trap -- [ ] README in datasets repo marks v4 as recommended -- [ ] Previous versions marked as superseded - -**PR:** Two PRs (one per repo). - ---- - -## Dependency graph - -``` -v4-M0 (this PR) - └── v4-M1 (engine changes) - └── v4-M2 (build pipeline + validation) - └── v4-M3 (docs + release) -``` - -Strictly sequential — each milestone depends on the previous. - ---- - -## Timeline estimate - -Not providing time estimates per project convention. The work is 4 PRs of moderate size (~300–500 lines each). - ---- - -## What this plan does NOT do - -- Does not change the simulation loop (`engine.py` daily step logic) -- Does not change the relational bundle format -- Does not change exposure modes -- Does not add new recipes -- Does not implement M12 (CLI polish) — deferred -- Does not implement the engine fix for `is_sql=False → never converts` (deferred to a separate issue; v4 avoids `is_sql` entirely) diff --git a/docs/v4/lead_scoring_v4_requirements.md b/docs/v4/lead_scoring_v4_requirements.md deleted file mode 100644 index ff2e146..0000000 --- a/docs/v4/lead_scoring_v4_requirements.md +++ /dev/null @@ -1,131 +0,0 @@ -# Lead Scoring Dataset v4 — Requirements - -## Purpose - -This document defines the requirements for the **v4 lead scoring intro dataset**, the primary pedagogical output of leadforge for a BA-level intro ML course. It is informed by three prior dataset iterations (v1–v3) and the lessons learned from each. - -## Prior version history and lessons - -| Version | Key issue | What we learned | -|---|---|---| -| v1 | `funnel_stage` contained `closed_won`/`closed_lost` — perfect leakage | Must validate that no single feature determines the target | -| v2 | Snapshot at day 90 with 90-day target — post-mortem, not prediction | Snapshot must be strictly earlier than outcome horizon | -| v2 | `reached_sql=0` → 0% conversion (n=127); `has_opportunity=1` → 0% (n=235) | Binary proxies from engine invariants create deterministic groups | -| v3 | Day-21 snapshot + non-deterministic proxies — clean but AUC only 0.62 | Engine's intro difficulty produces flat category effects; early features lack signal | - -## v4 requirements - -### R1 — Operational decision framing (capacity + value) - -**Problem:** v1–v3 frame lead scoring as pure classification. Real lead scoring is a **decision tool** — ranking leads by expected value, not just probability. - -**Requirement:** -- Include an `expected_acv` numeric feature (estimated annual contract value) available at snapshot time. -- The feature must be derived from the opportunity table (for leads with an opportunity by snapshot) or from account-level heuristics (employee band → ACV range midpoint) for leads without one. -- This enables students to compute `expected_value = P(conversion) × expected_acv` and practice ranking/top-K selection. - -**Engine change needed:** The snapshot builder must join opportunity ACV data gated by snapshot day, with a fallback to account-band heuristic ACV. - -### R2 — Safe temporal / momentum features - -**Problem:** v1–v3 engagement features are cumulative counts with no temporal shape. Real lead scoring uses recency and momentum signals. - -**Requirement:** -- Include exactly one momentum feature: `touches_week_1` (touches in days 0–7 after lead creation). -- This is strictly pre-snapshot (snapshot is at day 21+) and gives students a "first-week intensity" signal to compare against total touches. -- Additionally, `days_since_first_touch` (snapshot_day minus day of first touch) provides a lead-age signal. - -**Engine change needed:** The snapshot builder must compute windowed aggregates from event timestamps. - -### R3 — Structured missingness (not only MCAR) - -**Problem:** v1–v3 inject missingness randomly (MCAR). Real CRM data has structured gaps. - -**Requirement:** Implement three missingness patterns: -1. **Natural (structural):** `days_since_last_touch` is NaN when `total_touches == 0` (no touches recorded). Already exists but must be preserved. -2. **Conditional on source:** `web_sessions` is missing for ~15% of `sdr_outbound` leads (CRM tracking often not set up for outbound-sourced leads) but only ~2% of `inbound_marketing` leads. -3. **Role data gap:** `seniority` is missing for ~8% of `partner_referral` leads (referral partners don't always provide full contact details). - -**Engine change needed:** Missingness injection in the snapshot builder, conditioned on feature values. - -### R4 — Deliberate leakage trap - -**Problem:** Students need to practice identifying leakage, but v1–v3 either have accidental leakage (bad) or none at all (missed teaching opportunity). - -**Requirement:** -- Include one feature `total_touches_all` that counts **all** touches over the full 90-day window, not just up to snapshot. -- This feature is strongly predictive (uses future data) but not perfectly deterministic (it correlates with but doesn't fully determine conversion). -- The feature MUST be clearly labeled as "intentionally invalid — included for leakage discussion" in `RELEASE_v4.md` and the feature dictionary. -- The validation script must flag it, but the v4 build script intentionally includes it. -- The `BACKGROUND.md` / student instructions must NOT reveal the trap — students should discover it through EDA. - -**Engine change needed:** The snapshot builder computes a second touch count using the full horizon. - -### R5 — Reduce redundancy - -**Problem:** `total_touches = inbound_touches + outbound_touches` is a perfect linear dependency. Students may be confused by it, or models waste a degree of freedom. - -**Requirement:** -- Drop `total_touches` from v4. Keep `inbound_touches` and `outbound_touches` as the touch breakdown. -- Note: `total_touches_all` (the leakage trap from R4) is a different feature and is kept. -- Document this as a teaching point: "you can derive total from inbound + outbound." - -### R6 — Stronger category signal - -**Problem:** At intro difficulty, category conversion rates span only 2–11%. This makes the dataset nearly impossible to model well (AUC ~0.62). - -**Requirement:** -- The engine must produce category-level conversion rate spreads of at least 15–25% for key features (`contact_role`, `company_revenue`, `seniority`). -- Target baseline LR AUC: **0.70–0.85** (after snapshot + subsampling). -- This requires engine changes to the difficulty profile or mechanism weights, not just post-hoc manipulation. - -**Engine change needed:** Adjust intro difficulty profile or mechanism policy to produce wider category effects. - -### R7 — Robust automated validation - -**Requirement:** The v4 dataset must pass all of the following automated checks: - -| Check | Criterion | -|---|---| -| No banned columns | No `current_stage`, `funnel_stage`, `conversion_timestamp`, `is_sql` | -| No deterministic groups | For every feature value with n≥50: conversion rate in [2%, 98%] | -| Conversion rate | In [15%, 40%] | -| Baseline LR AUC | In [0.65, 0.90] (all features except leakage trap) | -| Leakage trap AUC boost | AUC with trap > AUC without trap by ≥0.03 | -| Missingness per column | Each column with nulls: 1–15% missing | -| Missingness structure | `web_sessions` missing rate for `sdr_outbound` > 3× rate for `inbound_marketing` | -| Row count | Exactly 1,000 | -| Column count | 16–18 (features + target) | -| Reproducibility | Same seed → identical output | - -## v4 target column set - -| # | Column | Type | Source | Notes | -|---|---|---|---|---| -| 1 | `industry` | categorical | account | 4 values | -| 2 | `region` | categorical | account | US, UK | -| 3 | `company_size` | categorical | account | 4 bands | -| 4 | `company_revenue` | categorical | account | 4 bands | -| 5 | `contact_role` | categorical | contact | 4 roles | -| 6 | `seniority` | categorical | contact | 5 levels (~8% missing for partner_referral) | -| 7 | `lead_source` | categorical | lead | 3 channels | -| 8 | `opportunity_created` | binary 0/1 | derived | Opp opened by snapshot day | -| 9 | `demo_completed` | binary 0/1 | derived | Demo done by snapshot day | -| 10 | `expected_acv` | numeric | derived | Opp ACV if available, else band midpoint (R1) | -| 11 | `inbound_touches` | integer | events ≤ snapshot | Inbound touchpoints | -| 12 | `outbound_touches` | integer | events ≤ snapshot | Outbound touchpoints | -| 13 | `touches_week_1` | integer | events ≤ day 7 | First-week touch intensity (R2) | -| 14 | `web_sessions` | integer | events ≤ snapshot | Sessions (~15% missing for outbound, ~2% inbound) | -| 15 | `sales_activities` | integer | events ≤ snapshot | Sales activities count | -| 16 | `days_since_last_touch` | float | events ≤ snapshot | Natural NaN when no touches | -| 17 | `total_touches_all` | integer | **ALL events** | ⚠️ LEAKAGE TRAP — uses full 90-day window | -| 18 | `converted` | binary 0/1 | target | Converted within 90 days | - -Total: 17 features + 1 target = 18 columns. - -## Non-goals for v4 - -- v4 does NOT require engine changes to the simulation loop itself (stage transitions, churn, conversion hazard). -- v4 does NOT change the relational bundle format or task splits. -- v4 does NOT require a new recipe — it uses `b2b_saas_procurement_v1` with adjusted difficulty tuning. -- v4 does NOT need to change the `student_public` / `research_instructor` exposure modes. diff --git a/docs/v4/planning_pr_review.md b/docs/v4/planning_pr_review.md new file mode 100644 index 0000000..3290d34 --- /dev/null +++ b/docs/v4/planning_pr_review.md @@ -0,0 +1,67 @@ +# PR #19 Self-Review — Critical Assessment + +## Point 1: Unvalidated numbers (`category_effect_scale`, AUC target) + +**Issue:** `category_effect_scale: 1.8` appears in three docs as a known-good value, but has zero empirical backing. Similarly, `snapshot_day=21` gave AUC 0.622 in v3 — below the 0.65 floor mandated here. The plan has a circular dependency: AUC target assumes the engine change works, engine change spec assumes AUC target is reachable. + +**Treatment:** Run a spike experiment before merging. Patch `category_effect_scale` into mechanism policy, generate a 5000-lead bundle at day-21 snapshot, measure category spread + LR AUC. Record results in engine_changes_spec. Adjust the number before it's enshrined across multiple documents. + +--- + +## Point 2: Five overlapping spec documents + +**Issue:** Five docs (~500 lines) to describe ~550 lines of implementation. Significant content overlap — "18 columns" appears in requirements, contract, AND validation spec. Missingness rates appear in three places. Changing one number means updating 3 files. + +**Treatment:** Consolidate into two docs: one design doc (`docs/v4/design.md`) covering requirements, contract, and engine changes; one validation spec (`docs/v4/validation_spec.md`). Use a single source of truth for shared constants (column list, missingness rates, AUC bounds). + +--- + +## Point 3: "No sim loop changes" masks a known bug + +**Issue:** Repeatedly emphasizing "no changes to engine.py" as a feature, when the actual problem — `is_sql=False → 0% conversion` creating deterministic groups — lives there. The plan designs around a bug it refuses to fix, but doesn't state that clearly. + +**Treatment:** Add an explicit "Known Limitations & Workarounds" section to the design doc. State plainly: v4's column set excludes `reached_sql` and `has_opportunity` because `is_sql=False → 0% conversion` is a simulation invariant we're choosing not to fix here. Link to a tracked issue for the future engine fix. Don't hide it in a deferred-items table. + +--- + +## Point 4: Arbitrary missingness rates + +**Issue:** Missingness rates (15%/2%/5% for web_sessions, 8%/1% for seniority) are unjustified. The validation check ("outbound > 3× inbound") is a tautology given the hardcoded rates. + +**Treatment:** State the pedagogical rationale: rates must be detectable at n=1000 with a chi-squared test at p<0.01, but not so extreme students can't impute. Validate this claim in the spike experiment. Acknowledge these are tunable, not ground truth. + +--- + +## Point 5: No failure mode handling + +**Issue:** The plan assumes every parameter works on the first try. No guidance for what to do when AUC is too low/high, leakage trap doesn't boost, or subsampling destroys signal. For a dataset on v4 because v1–v3 had unforeseen problems, this is remarkably optimistic. + +**Treatment:** Add a "Tuning Protocol" decision tree: +- AUC < 0.65 → increase `category_effect_scale` (try 2.0/2.5/3.0) +- AUC > 0.90 → decrease scale or add noise +- Leakage trap boost < 0.03 → widen snapshot window gap (day 14 instead of 21) +- Subsampling destroys signal → increase n_leads from 5000 to 10000 + +--- + +## Point 6: AGENTS.md will rot + +**Issue:** v4-specific content (file-per-milestone lists, validation checklist, testing commands) hardcoded into a permanent repo doc. Becomes stale noise after v4 ships. + +**Treatment:** Move v4-specific content into `docs/v4/` only. AGENTS.md keeps durable conventions plus a single pointer: "For v4 implementation details, see `docs/v4/`." Delete v4 content from AGENTS.md after v4 ships — or don't put it there in the first place. + +--- + +## Point 7: `expected_acv` underspecified + +**Issue:** "Opportunity ACV if opp created by snapshot; else band midpoint" — but what's the midpoint of "$100M+"? What if band is null? One table row in the longest spec doc. + +**Treatment:** Define band→midpoint mapping as an explicit lookup table in the design doc. Specify null-band behavior (population median or NaN). + +--- + +## Point 8: M1/M2 coupling is too rigid + +**Issue:** M1 (engine knob) and M2 (build pipeline) are tightly coupled — M1 can't be validated without M2's build script. "Strictly sequential" milestones pretend otherwise. In practice, both will be developed with feedback loops. + +**Treatment:** Merge M1 and M2 into a single milestone with two deliverables. One PR with both the engine knob and the build script, validated end-to-end, is more honest and more reviewable. diff --git a/docs/v4/validation_spec.md b/docs/v4/validation_spec.md index 4c79c62..33158b0 100644 --- a/docs/v4/validation_spec.md +++ b/docs/v4/validation_spec.md @@ -1,5 +1,7 @@ # v4 Validation Specification +> Companion to `design.md`. Column set, missingness rates, and AUC targets are defined there — this doc specifies the automated checks. + ## Overview v4 validation operates at two levels: diff --git a/scripts/spike_category_signal.py b/scripts/spike_category_signal.py new file mode 100644 index 0000000..27ba972 --- /dev/null +++ b/scripts/spike_category_signal.py @@ -0,0 +1,259 @@ +#!/usr/bin/env python3 +"""Spike experiment: measure category → conversion signal under different settings. + +Tests: +1. Baseline (current engine) — expect near-zero category signal +2. Correlated observables (1x boost) — seniority/revenue/source → latent traits +3. Correlated observables (1.8x boost) — stronger correlation + +Reports category spread (max - min conversion rate) per categorical feature +and logistic regression AUC at day-21 snapshot. +""" + +from __future__ import annotations + +import sys +from pathlib import Path + +import numpy as np +import pandas as pd +from sklearn.linear_model import LogisticRegression +from sklearn.metrics import roc_auc_score +from sklearn.preprocessing import LabelEncoder + +# Ensure the package is importable. +sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) + +from leadforge.api.generator import Generator +from leadforge.render.snapshots import build_snapshot +from leadforge.simulation.engine import simulate_world +from leadforge.simulation.population import PopulationResult, build_population +from leadforge.structure.sampler import sample_hidden_graph + +SEED = 42 +N_LEADS = 5000 +SUBSAMPLE_N = 1000 +TARGET_RATE = 0.30 +CAT_FEATURES = [ + "industry", + "region", + "estimated_revenue_band", + "role_function", + "seniority", + "lead_source", +] + +# Base boosts (scale=1.0) +SENIORITY_BOOST = { + "individual_contributor": -0.15, + "manager": -0.05, + "director": 0.05, + "vp": 0.12, + "c_suite": 0.20, +} +REVENUE_BOOST = { + "$1M-$10M": -0.10, + "$10M-$50M": 0.0, + "$50M-$200M": 0.10, + "$200M+": 0.18, +} +SOURCE_BOOST = { + "partner_referral": 0.12, + "inbound_marketing": 0.05, + "sdr_outbound": -0.08, +} + + +def subsample(df: pd.DataFrame, rng: np.random.RandomState) -> pd.DataFrame: + """Stratified subsample to SUBSAMPLE_N rows at TARGET_RATE conversion.""" + positives = df[df["converted_within_90_days"]] + negatives = df[~df["converted_within_90_days"]] + n_pos = int(SUBSAMPLE_N * TARGET_RATE) + n_neg = SUBSAMPLE_N - n_pos + + if len(positives) < n_pos: + print(f" WARNING: only {len(positives)} positives, need {n_pos}") + n_pos = len(positives) + if len(negatives) < n_neg: + print(f" WARNING: only {len(negatives)} negatives, need {n_neg}") + n_neg = len(negatives) + + pos_sample = positives.sample(n=n_pos, random_state=rng) + neg_sample = negatives.sample(n=n_neg, random_state=rng) + return pd.concat([pos_sample, neg_sample]).sample(frac=1, random_state=rng) + + +def measure_category_spread(df: pd.DataFrame) -> dict[str, dict]: + """Conversion rate spread for groups with n >= 50, plus per-value detail.""" + results = {} + for col in CAT_FEATURES: + if col not in df.columns: + continue + stats = df.groupby(col)["converted_within_90_days"].agg(["mean", "count"]) + large = stats[stats["count"] >= 50] + spread = float(large["mean"].max() - large["mean"].min()) if len(large) >= 2 else 0.0 + # Show per-value rates for groups with n >= 30 + detail = stats[stats["count"] >= 30].sort_values("mean", ascending=False) + results[col] = { + "spread": spread, + "detail": {str(v): (f"{r['mean']:.1%}", int(r["count"])) for v, r in detail.iterrows()}, + } + return results + + +def measure_auc(df: pd.DataFrame) -> float: + """Logistic regression AUC using all snapshot features.""" + feature_cols = [c for c in df.columns if c != "converted_within_90_days"] + x_df = df[feature_cols].copy() + y = df["converted_within_90_days"].astype(int) + + for col in x_df.select_dtypes(include=["object", "category"]).columns: + le = LabelEncoder() + x_df[col] = le.fit_transform(x_df[col].astype(str)) + + x_df = x_df.select_dtypes(include=[np.number]) + x_df = x_df.fillna(x_df.median()) + + lr = LogisticRegression(max_iter=1000, random_state=42) + lr.fit(x_df, y) + probs = lr.predict_proba(x_df)[:, 1] + return float(roc_auc_score(y, probs)) + + +def patch_population(pop: PopulationResult, scale: float = 1.0) -> None: + """Correlate observable categories with latent traits.""" + # Seniority → latent_contact_authority + for contact in pop.contacts: + cid = contact.contact_id + if cid in pop.latent_state.contact_latents: + boost = SENIORITY_BOOST.get(contact.seniority, 0.0) * scale + traits = pop.latent_state.contact_latents[cid] + traits["latent_contact_authority"] = max( + 0.0, min(1.0, traits["latent_contact_authority"] + boost) + ) + + # Revenue band → latent_account_fit + for account in pop.accounts: + aid = account.account_id + if aid in pop.latent_state.account_latents: + boost = REVENUE_BOOST.get(account.estimated_revenue_band, 0.0) * scale + traits = pop.latent_state.account_latents[aid] + traits["latent_account_fit"] = max(0.0, min(1.0, traits["latent_account_fit"] + boost)) + + # Lead source → latent_engagement_propensity + for lead in pop.leads: + cid = lead.contact_id + if cid in pop.latent_state.contact_latents: + boost = SOURCE_BOOST.get(lead.lead_source, 0.0) * scale + traits = pop.latent_state.contact_latents[cid] + traits["latent_engagement_propensity"] = max( + 0.0, min(1.0, traits["latent_engagement_propensity"] + boost) + ) + + +def run_pipeline(label: str, gen: Generator, scale: float | None = None) -> None: + """Generate, optionally patch, simulate, snapshot, subsample, measure.""" + print(f"\n{'=' * 60}") + print(f" {label}") + print(f"{'=' * 60}") + + config = gen._world_spec.config + narrative = gen._world_spec.narrative + if narrative is None: + raise RuntimeError("No narrative loaded") + + world_graph = sample_hidden_graph(config.seed) + print(f" Motif family: {world_graph.motif_family}") + + pop = build_population(config, narrative, world_graph) + + if scale is not None: + patch_population(pop, scale=scale) + + sim = simulate_world(config, pop, world_graph) + snapshot = build_snapshot(sim, pop) + + raw_rate = snapshot["converted_within_90_days"].mean() + print(f" Raw conversion rate: {raw_rate:.1%} (n={len(snapshot)})") + + rng = np.random.RandomState(SEED) + df = subsample(snapshot, rng) + actual_rate = df["converted_within_90_days"].mean() + print(f" Subsampled: n={len(df)}, conversion={actual_rate:.1%}") + + results = measure_category_spread(df) + print("\n Category spreads (groups n>=50):") + for feat in CAT_FEATURES: + if feat not in results: + continue + info = results[feat] + print(f"\n {feat}: spread={info['spread']:.1%}") + for val, (rate, n) in info["detail"].items(): + marker = "*" if n >= 50 else " " + print(f" {marker} {val:30s} rate={rate} n={n}") + + auc = measure_auc(df) + print(f"\n Logistic regression AUC (train): {auc:.3f}") + return auc + + +def main() -> None: + results = {} + + # Experiment 1: Baseline + gen = Generator.from_recipe( + "b2b_saas_procurement_v1", + seed=SEED, + exposure_mode="research_instructor", + n_leads=N_LEADS, + difficulty="intro", + ) + results["baseline"] = run_pipeline("BASELINE (current engine)", gen, scale=None) + + # Experiment 2: Scale 1.0 + gen2 = Generator.from_recipe( + "b2b_saas_procurement_v1", + seed=SEED, + exposure_mode="research_instructor", + n_leads=N_LEADS, + difficulty="intro", + ) + results["scale_1.0"] = run_pipeline("PATCHED scale=1.0", gen2, scale=1.0) + + # Experiment 3: Scale 1.8 + gen3 = Generator.from_recipe( + "b2b_saas_procurement_v1", + seed=SEED, + exposure_mode="research_instructor", + n_leads=N_LEADS, + difficulty="intro", + ) + results["scale_1.8"] = run_pipeline("PATCHED scale=1.8", gen3, scale=1.8) + + # Experiment 4: Scale 2.5 + gen4 = Generator.from_recipe( + "b2b_saas_procurement_v1", + seed=SEED, + exposure_mode="research_instructor", + n_leads=N_LEADS, + difficulty="intro", + ) + results["scale_2.5"] = run_pipeline("PATCHED scale=2.5", gen4, scale=2.5) + + # Summary + print(f"\n{'=' * 60}") + print(" SUMMARY") + print(f"{'=' * 60}") + for label, auc in results.items(): + print(f" {label:<30s} AUC={auc:.3f}") + + print() + print(" KEY FINDING: The spec's approach of scaling CategoricalInfluence") + print(" weights in LatentScore is incorrect — CategoricalInfluence is") + print(" not used in the conversion score. The correct approach is to") + print(" correlate observable categories with latent traits during") + print(" population generation.") + + +if __name__ == "__main__": + main()