leadforge-dev · shaypal5 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/.agent-plan.md b/.agent-plan.md
@@ -12,39 +12,35 @@
 
 ## Next Up — v4 Lead Scoring Dataset
 
-The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes followed by dataset build scripts.
+The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes + a build pipeline, followed by dataset release.
 
-See `docs/v4/implementation_plan.md` for full details.
+See `docs/v4/design.md` for full details.
 
-### v4-M0: Requirements + planning ⬜ (this PR)
+### v4-M0: Planning + spike ⬜ (this PR)
 
-- [x] `docs/v4/lead_scoring_v4_requirements.md`
-- [x] `docs/v4/dataset_contract.md`
-- [x] `docs/v4/validation_spec.md`
-- [x] `docs/v4/engine_changes_spec.md`
-- [x] `docs/v4/implementation_plan.md`
-- [x] Updated `CLAUDE.md` with repo map + generation commands
-- [x] Updated `AGENTS.md` with v4 implementation guide
-- [x] Updated `.agent-plan.md` (this file)
+- [x] `docs/v4/design.md` — consolidated requirements, contract, engine changes, implementation plan
+- [x] `docs/v4/validation_spec.md` — automated validation checks
+- [x] `docs/v4/planning_pr_review.md` — self-review and treatment plan
+- [x] `scripts/spike_category_signal.py` — spike experiment validating category signal approach
+- [x] Updated `CLAUDE.md`, `AGENTS.md`, `.agent-plan.md`
 
-### v4-M1: Engine — category signal + windowed snapshots ⬜
+### v4-M1: Engine + build pipeline ⬜
 
-- [ ] Add `category_effect_scale` to difficulty profiles
-- [ ] Apply scale in `mechanisms/policies.py`
-- [ ] Add `snapshot_day` parameter to `render/snapshots.py`
-- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv`
-- [ ] Add new `FeatureSpec` entries to `schema/features.py`
-- [ ] Tests for all changes
-- [ ] Verify category spread ≥15% for key features at intro difficulty
+Engine changes:
+- [ ] Add `category_latent_correlations` to `difficulty_profiles.yaml` (intro profile)
+- [ ] Apply correlations in `simulation/population.py` after initial latent sampling
+- [ ] Add `snapshot_day` parameter to `render/snapshots.py` with windowed aggregation
+- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv`, `total_touches_all`
+- [ ] Add `FeatureSpec` entries to `schema/features.py`
+- [ ] Tests for all engine changes (backward compat + v4 mode)
 
-### v4-M2: Build pipeline + validation ⬜
-
-- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness
+Build pipeline:
+- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness + subsampling
 - [ ] `scripts/validate_v4_dataset.py` — full validation per `docs/v4/validation_spec.md`
-- [ ] Generate test dataset and verify all checks pass
+- [ ] End-to-end: generate bundle → build CSV → validate → all checks pass
 - [ ] LR AUC 0.65–0.90 (without trap); ≥0.03 boost with trap
 
-### v4-M3: Documentation + release ⬜
+### v4-M2: Documentation + release ⬜
 
 - [ ] Generate `lead_scoring_intro_v4.csv` (in datasets-private repo)
 - [ ] Write `RELEASE_v4.md`
@@ -62,7 +58,7 @@ See `docs/v4/implementation_plan.md` for full details.
 | M12: CLI `--json` flag | Deferred | No consumer needs it yet; add post-v4 |
 | M12: CLI `--strict` flag | Deferred | Per-check control is better than global flag |
 | M12: CLI help text polish | Deferred | Low priority vs dataset |
-| M14: Sample bundle commit | Absorbed into v4-M3 | v4 dataset IS the sample |
+| M14: Sample bundle commit | Absorbed into v4-M2 | v4 dataset IS the sample |
 | M14: Notebook 1 (inspecting world) | Deferred | Do after v4 ships |
 | M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this |
 | M14: Notebook 3 (public vs instructor) | Discarded | No current audience |
@@ -83,11 +79,10 @@ See `docs/v4/implementation_plan.md` for full details.
 
 ## Context Pointers
 
-- v4 requirements: `docs/v4/lead_scoring_v4_requirements.md`
-- v4 dataset contract: `docs/v4/dataset_contract.md`
-- v4 engine changes: `docs/v4/engine_changes_spec.md`
+- v4 design (requirements, contract, engine changes, plan): `docs/v4/design.md`
 - v4 validation spec: `docs/v4/validation_spec.md`
-- v4 implementation plan: `docs/v4/implementation_plan.md`
+- v4 self-review: `docs/v4/planning_pr_review.md`
+- Spike experiment: `scripts/spike_category_signal.py`
 - Existing roadmap: `docs/leadforge_implementation_plan.md`
 - CLI commands: `leadforge/cli/commands/`
 - Validation modules: `leadforge/validation/`

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,6 +1,6 @@
 repos:
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.4.5
+    rev: v0.11.13
     hooks:
       - id: ruff
         args: [--fix]

diff --git a/AGENTS.md b/AGENTS.md
@@ -28,75 +28,6 @@ See CLAUDE.md for the full mandatory branch/PR workflow (branch → commit → u
 
 ---
 
-## v4 Implementation Guide
+## v4 Implementation
 
-### What is v4?
-
-A pedagogically improved lead scoring dataset (single CSV) for an intro ML course. The engine changes are small and targeted. See `docs/v4/` for full specs.
-
-### Implementation order
-
-```
-v4-M0 (planning PR — already done)
-  └── v4-M1: engine changes (category signal + windowed snapshots)
-        └── v4-M2: build pipeline + validation scripts
-              └── v4-M3: dataset generation + release docs
-```
-
-### Key files to modify per milestone
-
-**v4-M1 (engine):**
-- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale`
-- `leadforge/mechanisms/policies.py` — apply scale to categorical influences
-- `leadforge/render/snapshots.py` — add `snapshot_day` param, windowed aggregation
-- `leadforge/schema/features.py` — add new FeatureSpec entries
-- Tests in `tests/mechanisms/` and `tests/render/`
-
-**v4-M2 (build pipeline):**
-- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
-- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
-- These live in the leadforge repo (not datasets-private)
-
-**v4-M3 (release):**
-- Work in `leadforge-datasets-private` repo
-- `lead_scoring_intro/lead_scoring_intro_v4.csv`
-- `lead_scoring_intro/RELEASE_v4.md`
-
-### Coding conventions for v4
-
-1. **Backward compatibility:** All engine changes must default to current behavior. New parameters must have defaults that produce identical output when unset.
-2. **No simulation loop changes:** Do not modify the daily step logic in `engine.py`. v4 changes are in mechanism weights and snapshot rendering only.
-3. **Temporal correctness:** Every feature computation must be explicitly gated by snapshot day. Use `event_timestamp <= lead_created_at + snapshot_day` — never `<`.
-4. **Test coverage:** Every new parameter and feature must have unit tests. Test both `snapshot_day=None` (backward compat) and `snapshot_day=21` (v4 mode).
-5. **Determinism:** All new stochastic operations must use seeded RNG. Verify with a determinism test (same seed → identical output).
-
-### Validation checklist for v4 dataset
-
-Before declaring v4-M2 complete, the dataset must pass:
-
-- [ ] 1,000 rows, 18 columns
-- [ ] 30% conversion rate (±1%)
-- [ ] No deterministic groups (n≥50 at 0% or 100% conversion)
-- [ ] LR AUC 0.65–0.90 (without leakage trap)
-- [ ] LR AUC boost ≥0.03 when leakage trap included
-- [ ] `web_sessions` missingness: outbound rate > 3× inbound rate
-- [ ] `seniority` missingness: partner_referral rate > 3× others
-- [ ] Reproducible with seed 42
-- [ ] `total_touches_all` uses full 90-day data (confirmed by AUC boost)
-
-### How to test engine changes locally
-
-```bash
-# Quick smoke test: generate a small bundle and inspect
-leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 --difficulty intro --n-leads 1000 --out /tmp/test_bundle
-leadforge validate /tmp/test_bundle
-
-# Check category signal spread
-python -c "
-import pandas as pd
-df = pd.read_parquet('/tmp/test_bundle/tasks/converted_within_90_days/train.parquet')
-for col in ['role_function', 'seniority', 'estimated_revenue_band']:
-    rates = df.groupby(col)['converted_within_90_days'].mean()
-    print(f'{col}: spread={rates.max()-rates.min():.1%}')
-"
-```
+For v4 dataset design, engine changes, validation spec, and implementation plan, see `docs/v4/design.md` and `docs/v4/validation_spec.md`.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -349,16 +349,14 @@ Exception: deliberately included leakage traps (e.g., `total_touches_all` in v4)
 ## v4 Dataset Plan
 
 The current focus is producing a v4 lead scoring intro dataset. See `docs/v4/` for:
-- `lead_scoring_v4_requirements.md` — what v4 must achieve
-- `dataset_contract.md` — schema contract and temporal gates
-- `engine_changes_spec.md` — what changes in the engine
+- `design.md` — requirements, contract, engine changes, implementation plan (single source of truth)
 - `validation_spec.md` — automated validation checks
-- `implementation_plan.md` — milestone breakdown
+- `planning_pr_review.md` — self-review of the planning PR and treatment plan
 
 ---
 
 ## Reference Docs
 - Design decisions: `docs/leadforge_design_doc.md`
 - Architecture/spec: `docs/leadforge_architecture_spec.md`
 - Implementation roadmap: `docs/leadforge_implementation_plan.md`
-- v4 dataset plan: `docs/v4/implementation_plan.md`
+- v4 dataset plan: `docs/v4/design.md`
diff --git a/docs/v4/dataset_contract.md b/docs/v4/dataset_contract.md