diff --git a/.agent-plan.md b/.agent-plan.md
index fd215c5..9226797 100644
--- a/.agent-plan.md
+++ b/.agent-plan.md
@@ -12,39 +12,35 @@
 
 ## Next Up — v4 Lead Scoring Dataset
 
-The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes followed by dataset build scripts.
+The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes + a build pipeline, followed by dataset release.
 
-See `docs/v4/implementation_plan.md` for full details.
+See `docs/v4/design.md` for full details.
 
-### v4-M0: Requirements + planning ⬜ (this PR)
+### v4-M0: Planning + spike ⬜ (this PR)
 
-- [x] `docs/v4/lead_scoring_v4_requirements.md`
-- [x] `docs/v4/dataset_contract.md`
-- [x] `docs/v4/validation_spec.md`
-- [x] `docs/v4/engine_changes_spec.md`
-- [x] `docs/v4/implementation_plan.md`
-- [x] Updated `CLAUDE.md` with repo map + generation commands
-- [x] Updated `AGENTS.md` with v4 implementation guide
-- [x] Updated `.agent-plan.md` (this file)
+- [x] `docs/v4/design.md` — consolidated requirements, contract, engine changes, implementation plan
+- [x] `docs/v4/validation_spec.md` — automated validation checks
+- [x] `docs/v4/planning_pr_review.md` — self-review and treatment plan
+- [x] `scripts/spike_category_signal.py` — spike experiment validating category signal approach
+- [x] Updated `CLAUDE.md`, `AGENTS.md`, `.agent-plan.md`
 
-### v4-M1: Engine — category signal + windowed snapshots ⬜
+### v4-M1: Engine + build pipeline ⬜
 
-- [ ] Add `category_effect_scale` to difficulty profiles
-- [ ] Apply scale in `mechanisms/policies.py`
-- [ ] Add `snapshot_day` parameter to `render/snapshots.py`
-- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv`
-- [ ] Add new `FeatureSpec` entries to `schema/features.py`
-- [ ] Tests for all changes
-- [ ] Verify category spread ≥15% for key features at intro difficulty
+Engine changes:
+- [ ] Add `category_latent_correlations` to `difficulty_profiles.yaml` (intro profile)
+- [ ] Apply correlations in `simulation/population.py` after initial latent sampling
+- [ ] Add `snapshot_day` parameter to `render/snapshots.py` with windowed aggregation
+- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv`, `total_touches_all`
+- [ ] Add `FeatureSpec` entries to `schema/features.py`
+- [ ] Tests for all engine changes (backward compat + v4 mode)
 
-### v4-M2: Build pipeline + validation ⬜
-
-- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness
+Build pipeline:
+- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness + subsampling
 - [ ] `scripts/validate_v4_dataset.py` — full validation per `docs/v4/validation_spec.md`
-- [ ] Generate test dataset and verify all checks pass
+- [ ] End-to-end: generate bundle → build CSV → validate → all checks pass
 - [ ] LR AUC 0.65–0.90 (without trap); ≥0.03 boost with trap
 
-### v4-M3: Documentation + release ⬜
+### v4-M2: Documentation + release ⬜
 
 - [ ] Generate `lead_scoring_intro_v4.csv` (in datasets-private repo)
 - [ ] Write `RELEASE_v4.md`
@@ -62,7 +58,7 @@ See `docs/v4/implementation_plan.md` for full details.
 | M12: CLI `--json` flag | Deferred | No consumer needs it yet; add post-v4 |
 | M12: CLI `--strict` flag | Deferred | Per-check control is better than global flag |
 | M12: CLI help text polish | Deferred | Low priority vs dataset |
-| M14: Sample bundle commit | Absorbed into v4-M3 | v4 dataset IS the sample |
+| M14: Sample bundle commit | Absorbed into v4-M2 | v4 dataset IS the sample |
 | M14: Notebook 1 (inspecting world) | Deferred | Do after v4 ships |
 | M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this |
 | M14: Notebook 3 (public vs instructor) | Discarded | No current audience |
@@ -83,11 +79,10 @@ See `docs/v4/implementation_plan.md` for full details.
 
 ## Context Pointers
 
-- v4 requirements: `docs/v4/lead_scoring_v4_requirements.md`
-- v4 dataset contract: `docs/v4/dataset_contract.md`
-- v4 engine changes: `docs/v4/engine_changes_spec.md`
+- v4 design (requirements, contract, engine changes, plan): `docs/v4/design.md`
 - v4 validation spec: `docs/v4/validation_spec.md`
-- v4 implementation plan: `docs/v4/implementation_plan.md`
+- v4 self-review: `docs/v4/planning_pr_review.md`
+- Spike experiment: `scripts/spike_category_signal.py`
 - Existing roadmap: `docs/leadforge_implementation_plan.md`
 - CLI commands: `leadforge/cli/commands/`
 - Validation modules: `leadforge/validation/`
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 37495e0..56d794b 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -1,6 +1,6 @@
 repos:
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.4.5
+    rev: v0.11.13
     hooks:
       - id: ruff
         args: [--fix]
diff --git a/AGENTS.md b/AGENTS.md
index d67ced5..c72daa2 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -28,75 +28,6 @@ See CLAUDE.md for the full mandatory branch/PR workflow (branch → commit → u
 
 ---
 
-## v4 Implementation Guide
+## v4 Implementation
 
-### What is v4?
-
-A pedagogically improved lead scoring dataset (single CSV) for an intro ML course. The engine changes are small and targeted. See `docs/v4/` for full specs.
-
-### Implementation order
-
-```
-v4-M0 (planning PR — already done)
-  └── v4-M1: engine changes (category signal + windowed snapshots)
-        └── v4-M2: build pipeline + validation scripts
-              └── v4-M3: dataset generation + release docs
-```
-
-### Key files to modify per milestone
-
-**v4-M1 (engine):**
-- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale`
-- `leadforge/mechanisms/policies.py` — apply scale to categorical influences
-- `leadforge/render/snapshots.py` — add `snapshot_day` param, windowed aggregation
-- `leadforge/schema/features.py` — add new FeatureSpec entries
-- Tests in `tests/mechanisms/` and `tests/render/`
-
-**v4-M2 (build pipeline):**
-- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
-- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
-- These live in the leadforge repo (not datasets-private)
-
-**v4-M3 (release):**
-- Work in `leadforge-datasets-private` repo
-- `lead_scoring_intro/lead_scoring_intro_v4.csv`
-- `lead_scoring_intro/RELEASE_v4.md`
-
-### Coding conventions for v4
-
-1. **Backward compatibility:** All engine changes must default to current behavior. New parameters must have defaults that produce identical output when unset.
-2. **No simulation loop changes:** Do not modify the daily step logic in `engine.py`. v4 changes are in mechanism weights and snapshot rendering only.
-3. **Temporal correctness:** Every feature computation must be explicitly gated by snapshot day. Use `event_timestamp <= lead_created_at + snapshot_day` — never `<`.
-4. **Test coverage:** Every new parameter and feature must have unit tests. Test both `snapshot_day=None` (backward compat) and `snapshot_day=21` (v4 mode).
-5. **Determinism:** All new stochastic operations must use seeded RNG. Verify with a determinism test (same seed → identical output).
-
-### Validation checklist for v4 dataset
-
-Before declaring v4-M2 complete, the dataset must pass:
-
-- [ ] 1,000 rows, 18 columns
-- [ ] 30% conversion rate (±1%)
-- [ ] No deterministic groups (n≥50 at 0% or 100% conversion)
-- [ ] LR AUC 0.65–0.90 (without leakage trap)
-- [ ] LR AUC boost ≥0.03 when leakage trap included
-- [ ] `web_sessions` missingness: outbound rate > 3× inbound rate
-- [ ] `seniority` missingness: partner_referral rate > 3× others
-- [ ] Reproducible with seed 42
-- [ ] `total_touches_all` uses full 90-day data (confirmed by AUC boost)
-
-### How to test engine changes locally
-
-```bash
-# Quick smoke test: generate a small bundle and inspect
-leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 --difficulty intro --n-leads 1000 --out /tmp/test_bundle
-leadforge validate /tmp/test_bundle
-
-# Check category signal spread
-python -c "
-import pandas as pd
-df = pd.read_parquet('/tmp/test_bundle/tasks/converted_within_90_days/train.parquet')
-for col in ['role_function', 'seniority', 'estimated_revenue_band']:
-    rates = df.groupby(col)['converted_within_90_days'].mean()
-    print(f'{col}: spread={rates.max()-rates.min():.1%}')
-"
-```
+For v4 dataset design, engine changes, validation spec, and implementation plan, see `docs/v4/design.md` and `docs/v4/validation_spec.md`.
diff --git a/CLAUDE.md b/CLAUDE.md
index 1e3ae7e..c45eb93 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -349,11 +349,9 @@ Exception: deliberately included leakage traps (e.g., `total_touches_all` in v4)
 ## v4 Dataset Plan
 
 The current focus is producing a v4 lead scoring intro dataset. See `docs/v4/` for:
-- `lead_scoring_v4_requirements.md` — what v4 must achieve
-- `dataset_contract.md` — schema contract and temporal gates
-- `engine_changes_spec.md` — what changes in the engine
+- `design.md` — requirements, contract, engine changes, implementation plan (single source of truth)
 - `validation_spec.md` — automated validation checks
-- `implementation_plan.md` — milestone breakdown
+- `planning_pr_review.md` — self-review of the planning PR and treatment plan
 
 ---
 
@@ -361,4 +359,4 @@ The current focus is producing a v4 lead scoring intro dataset. See `docs/v4/` f
 - Design decisions: `docs/leadforge_design_doc.md`
 - Architecture/spec: `docs/leadforge_architecture_spec.md`
 - Implementation roadmap: `docs/leadforge_implementation_plan.md`
-- v4 dataset plan: `docs/v4/implementation_plan.md`
+- v4 dataset plan: `docs/v4/design.md`
diff --git a/docs/v4/dataset_contract.md b/docs/v4/dataset_contract.md
deleted file mode 100644
index 4b7cb0a..0000000
--- a/docs/v4/dataset_contract.md
+++ /dev/null
@@ -1,72 +0,0 @@
-# v4 Dataset Contract
-
-## Snapshot definition
-
-- **Snapshot day:** Day 21 after `lead_created_at` (configurable, default 21).
-- **Observation window:** Days 0–21 inclusive. All features computed from events in this window only.
-- **Prediction horizon:** Days 22–90. The target `converted` reflects whether `closed_won` occurs in the full 90-day window.
-- **Temporal guarantee:** No feature (except the explicitly marked leakage trap) uses information from after the snapshot day.
-
-## What is pre-snapshot (valid for features)
-
-| Data source | Temporal gate |
-|---|---|
-| Account attributes | Static — always valid |
-| Contact attributes | Static — always valid |
-| Lead metadata (source, etc.) | Lead creation — always valid |
-| Touch events | `touch_timestamp ≤ lead_created_at + snapshot_day` |
-| Session events | `session_timestamp ≤ lead_created_at + snapshot_day` |
-| Sales activity events | `activity_timestamp ≤ lead_created_at + snapshot_day` |
-| Opportunity records | `opportunity.created_at ≤ lead_created_at + snapshot_day` |
-| ACV estimates | From opportunity if available by snapshot; else account heuristic |
-
-## What is post-snapshot (invalid for features)
-
-| Data | Why invalid |
-|---|---|
-| `current_stage` at day 90 | Contains `closed_won` / `closed_lost` — outcome data |
-| `is_sql` (final state flag) | Engine invariant: `is_sql=False` → never converts. Deterministic. |
-| `conversion_timestamp` | Direct outcome information |
-| Touch/session/activity events after snapshot day | Future data |
-| Opportunity close outcome | Post-outcome |
-| `total_touches_all` | ⚠️ Intentional leakage trap — counts full 90-day touches |
-
-## Leakage trap contract
-
-The feature `total_touches_all` deliberately violates the snapshot boundary:
-- It counts touches over the **full 90-day simulation**, not just up to snapshot.
-- It is included to teach students about temporal leakage detection.
-- It must be clearly marked in the feature dictionary and release notes.
-- The validation script must detect it and flag it (but not fail the build).
-- Removing this feature should drop AUC by ≥0.03.
-
-## Missingness contract
-
-| Column | Pattern | Rate | Condition |
-|---|---|---|---|
-| `days_since_last_touch` | Structural | Natural | NaN when `total touches == 0` by snapshot |
-| `web_sessions` | Source-conditional | ~15% for `sdr_outbound`, ~2% for `inbound_marketing`, ~5% for `partner_referral` | CRM tracking gaps |
-| `seniority` | Source-conditional | ~8% for `partner_referral`, ~1% for others | Referral partners omit contact details |
-| `days_since_last_touch` | Additional MCAR | ~3% | Random CRM logging gaps (on top of structural) |
-
-## Target definition
-
-```
-converted = 1  if lead reached closed_won within 90 days of lead_created_at
-converted = 0  otherwise (including closed_lost, still in funnel, churned)
-```
-
-The target is derived from simulated events, never directly sampled.
-
-## Subsampling contract
-
-- Source bundle: 5,000 leads generated with `b2b_saas_procurement_v1`, seed 42, difficulty intro.
-- Stratified subsampling to 1,000 rows at ~30% conversion rate.
-- All negatives retained (up to 700); positives downsampled.
-- Subsampling preserves within-class feature distributions.
-
-## Reproducibility
-
-- Seed: 42 (or documented if changed).
-- All stochastic operations use `np.random.RandomState(seed)` or derived substreams.
-- Same (seed, recipe, leadforge version) → byte-identical CSV output.
diff --git a/docs/v4/design.md b/docs/v4/design.md
new file mode 100644
index 0000000..c6971ee
--- /dev/null
+++ b/docs/v4/design.md
@@ -0,0 +1,324 @@
+# v4 Lead Scoring Dataset — Design Document
+
+> Single source of truth for the v4 dataset: requirements, contract, engine changes, and implementation plan.
+> Validation checks are in the companion `validation_spec.md`.
+
+---
+
+## Prior versions and lessons
+
+| Version | Key issue | Lesson |
+|---|---|---|
+| v1 | `funnel_stage` contained `closed_won`/`closed_lost` — perfect leakage | Must validate that no single feature determines the target |
+| v2 | Snapshot at day 90 with 90-day target — post-mortem, not prediction | Snapshot must be strictly earlier than outcome horizon |
+| v2 | `reached_sql=0` → 0% conversion (n=127); `has_opportunity=1` → 0% (n=235) | Binary proxies from engine invariants create deterministic groups |
+| v3 | Day-21 snapshot + non-deterministic proxies — clean but AUC only 0.62 | Engine's intro difficulty produces flat category effects; early features lack signal |
+
+---
+
+## Requirements
+
+### R1 — Operational decision framing (expected ACV)
+
+Include an `expected_acv` numeric feature so students can compute `P(conversion) × expected_acv` and practice value-aware ranking.
+
+**ACV derivation (single source of truth):**
+
+| Condition | Value |
+|---|---|
+| Opportunity created by snapshot day | Opportunity's `estimated_acv` |
+| No opportunity, `estimated_revenue_band` known | Band midpoint (see table below) |
+| No opportunity, band unknown | NaN |
+
+**Revenue band → ACV midpoint mapping:**
+
+| Band | Midpoint ($k) |
+|---|---|
+| $1M–$10M | 25 |
+| $10M–$50M | 55 |
+| $50M–$200M | 85 |
+| $200M+ | 140 |
+
+These midpoints are derived from the engine's `_EMPLOYEE_ACV_RANGES` in `simulation/engine.py`, which maps employee bands to ACV ranges. Since the dataset exposes `estimated_revenue_band` (not employee band), the midpoints approximate the overlap between revenue bands and the engine's ACV sampling.
+
+### R2 — Safe temporal / momentum features
+
+- `touches_week_1`: touches in days 0–7 after lead creation. Strictly pre-snapshot.
+- `days_since_first_touch`: `snapshot_day - first_touch_day`. NaN if no touches.
+
+### R3 — Structured missingness (MAR, not only MCAR)
+
+Three patterns, each with a pedagogical rationale:
+
+| Column | Pattern | Rates | Rationale |
+|---|---|---|---|
+| `days_since_last_touch` | Structural | NaN when no touches by snapshot | Natural — no event to measure from |
+| `web_sessions` | Source-conditional | ~15% `sdr_outbound`, ~2% `inbound`, ~5% `partner` | CRM web tracking often not configured for outbound leads |
+| `seniority` | Source-conditional | ~8% `partner_referral`, ~1% others | Referral partners don't always provide full contact details |
+| `days_since_last_touch` | Additional MCAR | ~3% on top of structural | Random CRM logging gaps |
+
+**Why these specific rates:** They are chosen to be detectable at n≈1000 with a chi-squared test at p<0.01 (the outbound/inbound ratio for `web_sessions` is ~7.5×, well above the 3× detection threshold), but not so extreme that imputation becomes trivial. These are tunable parameters, not ground truth — the validation spec checks the *ratio* (>3×), not the exact rates.
+
+### R4 — Deliberate leakage trap
+
+`total_touches_all` counts ALL touches over the full 90-day window, violating the snapshot boundary. It is strongly predictive but not deterministic. Must be labeled in release notes and feature dictionary but NOT revealed in student-facing `BACKGROUND.md`.
+
+### R5 — Reduce redundancy
+
+Drop `total_touches` (= `inbound_touches + outbound_touches`). Keep the breakdown.
+
+### R6 — Stronger category signal
+
+See "Engine change 1" below. Target: ≥15% spread for at least two category features; baseline LR AUC 0.65–0.90.
+
+### R7 — Robust automated validation
+
+See `validation_spec.md`.
+
+---
+
+## Target column set
+
+| # | Column | Type | Source | Notes |
+|---|---|---|---|---|
+| 1 | `industry` | categorical | account | 4 values |
+| 2 | `region` | categorical | account | US, UK |
+| 3 | `company_size` | categorical | account | 4 bands |
+| 4 | `company_revenue` | categorical | account | 4 bands |
+| 5 | `contact_role` | categorical | contact | 4 roles |
+| 6 | `seniority` | categorical | contact | 5 levels (~8% missing for partner_referral) |
+| 7 | `lead_source` | categorical | lead | 3 channels |
+| 8 | `opportunity_created` | binary 0/1 | derived | Opp opened by snapshot day |
+| 9 | `demo_completed` | binary 0/1 | derived | Demo done by snapshot day |
+| 10 | `expected_acv` | numeric | derived | See R1 ACV derivation table |
+| 11 | `inbound_touches` | integer | events ≤ snapshot | Inbound touchpoints |
+| 12 | `outbound_touches` | integer | events ≤ snapshot | Outbound touchpoints |
+| 13 | `touches_week_1` | integer | events ≤ day 7 | First-week touch intensity |
+| 14 | `web_sessions` | integer | events ≤ snapshot | Sessions (~15% missing for outbound) |
+| 15 | `sales_activities` | integer | events ≤ snapshot | Sales activities count |
+| 16 | `days_since_last_touch` | float | events ≤ snapshot | Natural NaN when no touches |
+| 17 | `total_touches_all` | integer | **ALL events** | Leakage trap — full 90-day window |
+| 18 | `converted` | binary 0/1 | target | Converted within 90 days |
+
+Total: 17 features + 1 target = 18 columns.
+
+---
+
+## Snapshot contract
+
+- **Snapshot day:** 21 (configurable).
+- **Observation window:** Days 0–21 inclusive.
+- **Prediction horizon:** Days 22–90.
+- **Temporal guarantee:** No feature except `total_touches_all` uses post-snapshot data.
+
+| Data source | Temporal gate |
+|---|---|
+| Account/contact/lead attributes | Static — always valid |
+| Touch/session/activity events | `timestamp ≤ lead_created_at + snapshot_day` |
+| Opportunity records | `opportunity.created_at ≤ lead_created_at + snapshot_day` |
+
+### Target definition
+
+```
+converted = 1  if lead reached closed_won within 90 days of lead_created_at
+converted = 0  otherwise (including closed_lost, still in funnel, churned)
+```
+
+Derived from simulated events, never directly sampled.
+
+### Subsampling
+
+- Source bundle: 5,000 leads, `b2b_saas_procurement_v1`, seed 42, difficulty intro.
+- Stratified subsampling to 1,000 rows at ~30% conversion.
+- Subsampling uses `np.random.RandomState(seed)` for reproducibility.
+
+---
+
+## Engine changes
+
+### Change 1: Stronger category signal via population-level correlation
+
+#### Problem
+
+Observable categories (seniority, revenue band, lead source) are drawn **independently** from latent traits in `population.py`. The conversion hazard uses only latent traits (via `LatentScore`). Therefore category → conversion correlation is near-zero by construction.
+
+The v3 dataset confirms this: category spreads are 2–11% and LR AUC is 0.62.
+
+Note: `CategoricalInfluence` exists in `mechanisms/categorical.py` but is **never wired** into `assign_mechanisms()` or the simulation loop. The `MechanismContext` only passes latent traits, stage, time, and dwell days — not observable categories.
+
+#### Solution
+
+Correlate observable categories with latent traits during population generation. This is a population-layer change — no simulation loop modifications needed.
+
+Add a `category_latent_correlations` mapping to the difficulty profile, applied in `build_population()` after initial latent sampling:
+
+| Observable | Latent trait | Boost per value |
+|---|---|---|
+| `seniority` | `latent_contact_authority` | individual_contributor: −0.27, manager: −0.09, director: +0.09, vp: +0.22, c_suite: +0.36 |
+| `estimated_revenue_band` | `latent_account_fit` | $1M–$10M: −0.18, $10M–$50M: 0.0, $50M–$200M: +0.18, $200M+: +0.32 |
+| `lead_source` | `latent_engagement_propensity` | sdr_outbound: −0.14, inbound_marketing: +0.09, partner_referral: +0.22 |
+
+These are the scale=1.8 boosts from the spike experiment (`scripts/spike_category_signal.py`).
+
+#### Spike experiment results (seed 42, 5000 leads, fit_dominant motif)
+
+| Setting | AUC | seniority spread | revenue spread | role_function spread |
+|---|---|---|---|---|
+| Baseline (no correlation) | 0.663 | 9.5% | 10.8% | 11.0% |
+| Scale 1.0 | 0.650 | 5.5% | 10.3% | 3.7% |
+| **Scale 1.8** | **0.694** | **22.1%** | **15.2%** | 1.7% |
+| Scale 2.5 | 0.701 | 11.9% | 15.1% | 3.2% |
+
+**Observations:**
+- Scale 1.8 gives AUC 0.694, within the [0.65, 0.90] target.
+- `seniority` and `estimated_revenue_band` exceed the 15% spread target.
+- `role_function` gets **no boost** from this approach because there is no natural latent trait to correlate it with. The spike shows role_function spread is driven entirely by noise and varies widely across runs (1.7%–11%).
+- The `fit_dominant` motif gives zero weight to `latent_contact_authority`, so the seniority boost only works through its indirect correlation with other traits at population level. Different motif families will produce different spread profiles.
+- At scale 2.5, seniority spread *decreases* (11.9%) due to [0, 1] clamp saturation.
+
+#### Important caveats
+
+1. The spike tested only `fit_dominant` motif (seed 42). Other motifs weight different latent traits, so the same boosts will produce different category spreads. The implementation should test across all 5 motif families.
+2. `role_function` signal remains weak. If role_function spread ≥15% is required, a separate mechanism is needed (either role-specific latent biases in `population.py` or wiring `CategoricalInfluence` into the conversion score, which would require sim loop changes). For v4, we accept that not all categories will have strong signal.
+3. The boost values are empirical, not principled. They should be treated as starting points, not final values.
+
+#### Files affected
+
+- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_latent_correlations`
+- `leadforge/simulation/population.py` — apply correlations after initial latent sampling
+- `tests/simulation/test_population.py` — test correlation application, backward compat
+
+### Change 2: Windowed snapshot (`snapshot_day` parameter)
+
+Add `snapshot_day: int | None` to `build_snapshot()`. When set, all event aggregations filter to `timestamp ≤ lead_created_at + snapshot_day`. Default `None` preserves current behavior (full horizon).
+
+New features computed in the snapshot:
+- `touches_week_1` — count touches where `days_after_creation ≤ 7`
+- `days_since_first_touch` — `snapshot_day - first_touch_day` (NaN if no touches)
+- `expected_acv` — see R1 derivation table above
+- `total_touches_all` — full-horizon touch count (ignoring snapshot gate)
+
+#### Files affected
+
+- `leadforge/render/snapshots.py` — add `snapshot_day`, windowed filtering, new features
+- `leadforge/schema/features.py` — add `FeatureSpec` entries
+- `tests/render/test_snapshots.py` — test windowed aggregation
+
+### Change 3: Structured missingness + leakage trap (build script only)
+
+These are NOT engine changes. The build script (`scripts/build_v4_snapshot.py`) applies missingness injection and computes the leakage trap after snapshot construction.
+
+#### Files affected
+
+- `scripts/build_v4_snapshot.py` (new)
+- No changes to `leadforge/` core for missingness
+
+---
+
+## Known limitations and workarounds
+
+### `is_sql=False → never converts` (engine invariant)
+
+The simulation engine requires leads to pass through SQL stage before converting. This creates a deterministic group: `is_sql=False` → 0% conversion. v4 works around this by:
+
+- **Excluding** `is_sql` and `reached_sql` from the column set entirely
+- Using `opportunity_created` and `demo_completed` as non-deterministic binary proxies instead
+
+A proper fix would modify the conversion hazard in `engine.py` to allow rare direct conversions. This is tracked as a deferred item — it would benefit v5+ but is out of scope for v4.
+
+### `role_function` lacks signal
+
+The population-level correlation approach provides no mechanism for `role_function` to influence conversion (there is no natural latent trait to map it to). role_function spread in the v4 dataset will be noise-driven (2–11%). This is acceptable for an intro course but should be addressed in a future engine revision.
+
+---
+
+## Tuning protocol
+
+If validation checks fail during implementation, use these adjustments:
+
+| Failure | Adjustment |
+|---|---|
+| AUC < 0.65 | Increase boost scale (try 2.0, 2.5, 3.0) |
+| AUC > 0.90 | Decrease boost scale or add noise to latent correlations |
+| Leakage trap boost < 0.03 | Widen snapshot gap (try day 14 instead of 21) to increase information delta |
+| Subsampling destroys signal | Increase `n_leads` from 5000 to 10000 before subsampling |
+| Category spread < 15% for seniority/revenue | Increase individual boost magnitudes for that feature |
+| Deterministic group detected | Check which feature/value, adjust boost or drop the feature |
+
+---
+
+## Implementation plan
+
+### Milestone structure
+
+v4 work is split into **two implementation milestones** plus the planning PR:
+
+```
+v4-M0 (this PR — planning + spike)
+  └── v4-M1: engine + build pipeline (single PR)
+        └── v4-M2: dataset generation + release docs
+```
+
+v4-M1 merges the engine changes and build pipeline into one milestone because:
+- The engine change (population-level correlations) cannot be validated without the build script
+- The build script depends on the snapshot_day parameter from the engine change
+- A single PR with both, validated end-to-end, is more reviewable
+
+### v4-M1: Engine + build pipeline
+
+**Deliverables:**
+1. `difficulty_profiles.yaml` — `category_latent_correlations` for intro profile
+2. `simulation/population.py` — apply correlations during population generation
+3. `render/snapshots.py` — `snapshot_day` parameter, windowed aggregation, new features
+4. `schema/features.py` — new `FeatureSpec` entries
+5. `scripts/build_v4_snapshot.py` — day-21 snapshot + missingness + leakage trap + subsampling
+6. `scripts/validate_v4_dataset.py` — validation per `validation_spec.md`
+7. Tests for all changes
+
+**Acceptance criteria:**
+- [ ] No correlation (`category_latent_correlations` absent or empty) → identical output to current engine
+- [ ] Scale 1.8 correlations → seniority and revenue_band spread ≥15%
+- [ ] `snapshot_day=21` correctly filters events
+- [ ] `touches_week_1` counts only days 0–7
+- [ ] `expected_acv` uses ACV derivation table
+- [ ] Build script produces 1000 rows × 18 columns at 30% conversion
+- [ ] Validation script passes all mandatory checks
+- [ ] LR AUC in [0.65, 0.90] without trap; ≥0.03 boost with trap
+- [ ] All existing tests pass
+- [ ] Reproducible with seed 42
+
+### v4-M2: Documentation + release
+
+**Deliverables (in leadforge-datasets-private):**
+1. `lead_scoring_intro/lead_scoring_intro_v4.csv`
+2. `lead_scoring_intro/RELEASE_v4.md`
+3. Updated README
+
+**Deliverables (in leadforge):**
+1. Updated `.agent-plan.md`
+
+**Acceptance criteria:**
+- [ ] CSV passes all validation checks
+- [ ] RELEASE_v4.md documents snapshot day, target definition, changes from v3, leakage trap
+- [ ] Previous versions marked as superseded
+
+### Relationship to existing roadmap
+
+| Existing milestone | v4 interaction |
+|---|---|
+| M0–M11 | Complete, no changes |
+| M12 (CLI polish) | **Deferred** — low priority vs v4 |
+| M14 (Sample datasets + notebooks) | **Absorbed** — v4 dataset IS the sample |
+| M15 (Docs polish + v1.0 RC) | **Deferred** — do after v4 |
+
+Discarded: M14 notebooks 3–4 (no current audience).
+
+---
+
+## Non-goals
+
+- v4 does NOT modify the simulation loop (`engine.py` daily step logic).
+- v4 does NOT change the relational bundle format or task splits.
+- v4 does NOT add new recipes.
+- v4 does NOT change exposure modes.
+- v4 does NOT fix the `is_sql=False → never converts` invariant (deferred).
diff --git a/docs/v4/engine_changes_spec.md b/docs/v4/engine_changes_spec.md
deleted file mode 100644
index d3712ca..0000000
--- a/docs/v4/engine_changes_spec.md
+++ /dev/null
@@ -1,170 +0,0 @@
-# v4 Engine Changes Specification
-
-## Overview
-
-v4 requires **two categories** of changes to the leadforge codebase:
-1. **Mechanism / difficulty tuning** — make intro difficulty produce stronger category-level signal.
-2. **Snapshot builder enhancements** — compute windowed aggregates, ACV derivation, structured missingness, and the leakage trap feature.
-
-Neither category requires changes to the simulation loop itself (`engine.py`'s daily step logic). The simulation produces the same event stream; we change how features are derived from it.
-
----
-
-## Change 1: Stronger category signal at intro difficulty
-
-### Problem
-
-The current mechanism policy (`mechanisms/policies.py`) produces conversion rates that are nearly uniform across categories at intro difficulty. For example, `contact_role` spreads only 11% (25.6%–36.7% after subsampling). This yields LR AUC ~0.62, which is too low for a useful teaching dataset.
-
-### Root cause
-
-The `assign_mechanisms()` function builds a `LatentScore` with weights that are quite flat across categories. The intro difficulty profile specifies `signal_strength: 0.90` but this controls noise scale, not the magnitude of category effects.
-
-### Solution
-
-Add **category effect multipliers** to the difficulty profile YAML:
-
-```yaml
-intro:
-  # ... existing fields ...
-  category_effect_scale: 1.8  # amplify category → latent score effects
-```
-
-In `mechanisms/policies.py`, scale the `CategoricalInfluence` weights by `category_effect_scale` when building the `LatentScore`. This widens the gap between, say, `vp_finance` and `it_director` conversion rates without changing the overall noise structure.
-
-### Target outcome
-
-After this change + subsampling to 30%, category spreads should be:
-- `contact_role`: ≥15% spread
-- `company_revenue`: ≥12% spread
-- `seniority`: ≥10% spread
-- Baseline LR AUC: 0.70–0.85
-
-### Files affected
-
-- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale`
-- `leadforge/mechanisms/policies.py` — use `category_effect_scale` when building categorical influences
-- `tests/mechanisms/test_policies.py` — test that different scales produce different spread
-
-### Risk
-
-Low. The change is additive (new config field with a default of 1.0 for backward compatibility). Existing tests continue to pass at `category_effect_scale=1.0`.
-
----
-
-## Change 2: Snapshot builder — windowed aggregates and new features
-
-### Problem
-
-The current `render/snapshots.py` computes all aggregates over the full simulation horizon. v4 needs aggregates gated by a configurable snapshot day, plus new derived features.
-
-### Solution
-
-Add a new function or extend `build_snapshot()` to accept a `snapshot_day` parameter:
-
-```python
-def build_snapshot(
-    result: SimulationResult,
-    population: PopulationResult,
-    horizon_days: int = 90,
-    snapshot_day: int | None = None,  # NEW — default None means use horizon_days
-) -> pd.DataFrame:
-```
-
-When `snapshot_day` is set, all event aggregations filter to events within `[lead_created_at, lead_created_at + snapshot_day]`.
-
-### New features to compute
-
-| Feature | Computation | Notes |
-|---|---|---|
-| `touches_week_1` | Count touches where `days_after_creation ≤ 7` | Momentum signal |
-| `days_since_first_touch` | `snapshot_day - first_touch_day` (NaN if no touches) | Lead age signal |
-| `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature |
-| `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap |
-
-### Files affected
-
-- `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations
-- `leadforge/schema/features.py` — add new `FeatureSpec` entries for the new columns
-- `tests/render/test_snapshots.py` — test windowed aggregation correctness
-
-### Risk
-
-Medium. The snapshot builder is well-tested but core to correctness. The `snapshot_day` parameter should be additive (default `None` preserves existing behavior). New features are computed alongside existing ones.
-
----
-
-## Change 3: Structured missingness injection
-
-### Problem
-
-Current missingness is MCAR (random injection). v4 needs conditional missingness.
-
-### Solution
-
-Add a missingness injection step to the v4 build script (NOT to the engine's `build_snapshot`). This keeps the engine's output clean and makes missingness a dataset-packaging concern.
-
-The build script (`scripts/build_v4_snapshot.py`) applies missingness after snapshot construction:
-
-```python
-def inject_missingness(df: pd.DataFrame, rng: np.random.RandomState) -> pd.DataFrame:
-    # 1. web_sessions: 15% missing for sdr_outbound, 2% inbound, 5% partner
-    for source, rate in [("sdr_outbound", 0.15), ("inbound_marketing", 0.02), ("partner_referral", 0.05)]:
-        mask = (df["lead_source"] == source) & (rng.random(len(df)) < rate)
-        df.loc[mask, "web_sessions"] = np.nan
-
-    # 2. seniority: 8% missing for partner_referral, 1% for others
-    partner_mask = (df["lead_source"] == "partner_referral") & (rng.random(len(df)) < 0.08)
-    other_mask = (df["lead_source"] != "partner_referral") & (rng.random(len(df)) < 0.01)
-    df.loc[partner_mask | other_mask, "seniority"] = np.nan
-
-    # 3. days_since_last_touch: additional 3% MCAR on top of structural NaN
-    dslt_mask = rng.random(len(df)) < 0.03
-    df.loc[dslt_mask, "days_since_last_touch"] = np.nan
-
-    return df
-```
-
-### Files affected
-
-- `scripts/build_v4_snapshot.py` (new) — missingness injection
-- No changes to `leadforge/` core modules for missingness
-
-### Risk
-
-Low. Missingness is applied post-generation, outside the engine.
-
----
-
-## Change 4: Leakage trap feature
-
-### Problem
-
-Students need a feature that looks valid but violates temporal boundaries.
-
-### Solution
-
-The v4 build script computes `total_touches_all` by counting ALL touches in the full 90-day window (not gated by snapshot day). This is computed alongside the snapshot but uses different temporal filtering.
-
-### Files affected
-
-- `scripts/build_v4_snapshot.py` — compute `total_touches_all` from full event stream
-- Feature dictionary and release notes — mark as leakage trap
-
-### Risk
-
-None to the engine. The trap is a build-script concern.
-
----
-
-## Summary of engine vs. script changes
-
-| Change | Where | Risk |
-|---|---|---|
-| Category effect scaling | `leadforge/` core (mechanisms, difficulty profiles) | Low |
-| Snapshot `snapshot_day` parameter | `leadforge/` core (render/snapshots) | Medium |
-| New features (ACV, momentum, first_touch) | `leadforge/` core (render/snapshots, schema/features) | Medium |
-| Structured missingness | `scripts/` (build script only) | Low |
-| Leakage trap | `scripts/` (build script only) | None |
-
-Total engine-side changes: ~200–400 lines across 4–5 files. Build script: ~250 lines new.
diff --git a/docs/v4/implementation_plan.md b/docs/v4/implementation_plan.md
deleted file mode 100644
index 89b8238..0000000
--- a/docs/v4/implementation_plan.md
+++ /dev/null
@@ -1,172 +0,0 @@
-# v4 Implementation Plan
-
-## Overview
-
-This plan implements the v4 lead scoring dataset in 4 milestones across 4–6 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria.
-
-## Relationship to existing roadmap
-
-v4 work slots into the existing leadforge roadmap as follows:
-
-| Existing milestone | Status | v4 interaction |
-|---|---|---|
-| M0–M11 | ✅ Complete | No changes needed |
-| M12 (CLI polish) | ⬜ Planned | **Deferred** — low priority vs v4 dataset needs. Integrate after v4. |
-| M13 (Validation harness) | ✅ Implemented as M11 | v4 extends with dataset-level validation |
-| M14 (Sample datasets + notebooks) | ⬜ Planned | **Absorbed into v4-M3** — v4 dataset IS the sample dataset |
-| M15 (Docs polish + v1.0 RC) | ⬜ Planned | **Deferred** — do after v4 ships |
-
-### Explicitly discarded items
-
-| Item | Rationale |
-|---|---|
-| M12 `--json` flag for inspect/validate | Nice-to-have; no dataset consumer needs it yet. Can add later. |
-| M12 `--strict` flag for validate | Validation strictness is better controlled per-check, not globally. |
-| M14 Notebook 3 (public vs instructor comparison) | No current audience for this; instructor mode is not used in the course. |
-| M14 Notebook 4 (recipe customization walkthrough) | Premature — recipe system is stable but not user-facing yet. |
-
-### Explicitly kept / integrated items
-
-| Item | How it maps to v4 |
-|---|---|
-| M14 Sample bundle generation | v4-M2 generates the source bundle |
-| M14 Lead-scoring baseline notebook | v4-M3 includes a validation notebook or script |
-| M15 Docs audit | v4-M0 updates CLAUDE.md and AGENTS.md; v4-M3 produces RELEASE_v4.md |
-
----
-
-## v4 Milestones
-
-### v4-M0: Requirements, contract, and agent instructions
-
-**Goal:** Establish the v4 dataset contract and update repo documentation so implementation can begin immediately.
-
-**Deliverables:**
-- `docs/v4/lead_scoring_v4_requirements.md` — full requirements
-- `docs/v4/dataset_contract.md` — schema contract, temporal gates, missingness
-- `docs/v4/validation_spec.md` — automated check specifications
-- `docs/v4/engine_changes_spec.md` — what changes where and why
-- `docs/v4/implementation_plan.md` — this file
-- Updated `CLAUDE.md` — repository map, generation/validation commands
-- Updated `AGENTS.md` — implementation conventions for v4 work
-- Updated `.agent-plan.md` — reflects v4 as next work
-
-**Acceptance criteria:**
-- [ ] All docs are internally consistent
-- [ ] CLAUDE.md contains repo map and commands
-- [ ] .agent-plan.md points to v4 milestones
-- [ ] No contradictions with existing architecture docs
-
-**PR:** This PR (the planning PR).
-
----
-
-### v4-M1: Engine — category signal tuning + snapshot enhancements
-
-**Goal:** Make the engine produce datasets with stronger category signal and support windowed snapshot computation.
-
-**Deliverables:**
-1. `difficulty_profiles.yaml` — add `category_effect_scale: 1.8` to intro profile
-2. `mechanisms/policies.py` — apply `category_effect_scale` to categorical influence weights
-3. `render/snapshots.py` — add optional `snapshot_day` parameter for windowed aggregation
-4. `schema/features.py` — add `FeatureSpec` entries for new columns (`touches_week_1`, `days_since_first_touch`, `expected_acv`)
-5. Tests for all changes
-
-**Acceptance criteria:**
-- [ ] `category_effect_scale=1.0` produces identical output to current engine (backward compat)
-- [ ] `category_effect_scale=1.8` produces category spreads ≥15% for `contact_role`
-- [ ] `snapshot_day=21` correctly filters events to first 21 days
-- [ ] `touches_week_1` counts only days 0–7 touches
-- [ ] `expected_acv` uses opportunity ACV when available, else band midpoint
-- [ ] All existing tests pass
-- [ ] New tests cover the new parameters
-
-**Estimated size:** ~400 lines diff across 5 files + tests.
-
-**PR:** Single PR: `feat: v4 engine — category signal tuning + windowed snapshots`
-
----
-
-### v4-M2: Build pipeline — v4 snapshot builder + structured missingness
-
-**Goal:** Create the v4 build script that transforms a generated bundle into the final CSV.
-
-**Deliverables:**
-1. `scripts/build_v4_snapshot.py` — snapshot builder with:
-   - Day-21 windowed features
-   - Leakage trap feature (`total_touches_all`)
-   - Structured missingness injection
-   - Stratified subsampling to 1,000 rows / 30% conversion
-   - Column selection and renaming
-2. `scripts/validate_v4_dataset.py` — validation script per validation spec
-3. Generated `lead_scoring_intro_v4.csv` (in datasets repo, not leadforge)
-
-**Acceptance criteria:**
-- [ ] Build script produces 1,000 rows × 18 columns
-- [ ] Conversion rate is 30% (±1%)
-- [ ] `total_touches_all` uses full 90-day data (leakage trap)
-- [ ] `web_sessions` missing rate for outbound > 3× inbound rate
-- [ ] `seniority` missing rate for partner_referral > 3× others
-- [ ] `days_since_last_touch` has structural + injected NaNs
-- [ ] Validation script passes all mandatory checks
-- [ ] Baseline LR AUC (without trap) in [0.65, 0.90]
-- [ ] LR AUC boost with trap ≥ 0.03
-- [ ] No deterministic groups (n≥50 at 0% or 100%)
-- [ ] Reproducible with seed 42
-
-**Estimated size:** ~350 lines (build script) + ~200 lines (validator).
-
-**PR:** Single PR: `feat: v4 build pipeline + validation`
-
----
-
-### v4-M3: Documentation + release
-
-**Goal:** Produce the final dataset files and release documentation.
-
-**Deliverables (in leadforge-datasets-private repo):**
-1. `lead_scoring_intro/lead_scoring_intro_v4.csv`
-2. `lead_scoring_intro/RELEASE_v4.md`
-3. Updated `lead_scoring_intro/BACKGROUND.md` (if needed for v4 framing)
-4. Updated `README.md` (dataset index)
-
-**Deliverables (in leadforge repo):**
-1. Updated `.agent-plan.md` reflecting completion
-
-**Acceptance criteria:**
-- [ ] CSV passes all validation checks
-- [ ] RELEASE_v4.md documents snapshot day, target definition, changes from v3, leakage trap
-- [ ] README in datasets repo marks v4 as recommended
-- [ ] Previous versions marked as superseded
-
-**PR:** Two PRs (one per repo).
-
----
-
-## Dependency graph
-
-```
-v4-M0 (this PR)
-  └── v4-M1 (engine changes)
-        └── v4-M2 (build pipeline + validation)
-              └── v4-M3 (docs + release)
-```
-
-Strictly sequential — each milestone depends on the previous.
-
----
-
-## Timeline estimate
-
-Not providing time estimates per project convention. The work is 4 PRs of moderate size (~300–500 lines each).
-
----
-
-## What this plan does NOT do
-
-- Does not change the simulation loop (`engine.py` daily step logic)
-- Does not change the relational bundle format
-- Does not change exposure modes
-- Does not add new recipes
-- Does not implement M12 (CLI polish) — deferred
-- Does not implement the engine fix for `is_sql=False → never converts` (deferred to a separate issue; v4 avoids `is_sql` entirely)
diff --git a/docs/v4/lead_scoring_v4_requirements.md b/docs/v4/lead_scoring_v4_requirements.md
deleted file mode 100644
index ff2e146..0000000
--- a/docs/v4/lead_scoring_v4_requirements.md
+++ /dev/null
@@ -1,131 +0,0 @@
-# Lead Scoring Dataset v4 — Requirements
-
-## Purpose
-
-This document defines the requirements for the **v4 lead scoring intro dataset**, the primary pedagogical output of leadforge for a BA-level intro ML course. It is informed by three prior dataset iterations (v1–v3) and the lessons learned from each.
-
-## Prior version history and lessons
-
-| Version | Key issue | What we learned |
-|---|---|---|
-| v1 | `funnel_stage` contained `closed_won`/`closed_lost` — perfect leakage | Must validate that no single feature determines the target |
-| v2 | Snapshot at day 90 with 90-day target — post-mortem, not prediction | Snapshot must be strictly earlier than outcome horizon |
-| v2 | `reached_sql=0` → 0% conversion (n=127); `has_opportunity=1` → 0% (n=235) | Binary proxies from engine invariants create deterministic groups |
-| v3 | Day-21 snapshot + non-deterministic proxies — clean but AUC only 0.62 | Engine's intro difficulty produces flat category effects; early features lack signal |
-
-## v4 requirements
-
-### R1 — Operational decision framing (capacity + value)
-
-**Problem:** v1–v3 frame lead scoring as pure classification. Real lead scoring is a **decision tool** — ranking leads by expected value, not just probability.
-
-**Requirement:**
-- Include an `expected_acv` numeric feature (estimated annual contract value) available at snapshot time.
-- The feature must be derived from the opportunity table (for leads with an opportunity by snapshot) or from account-level heuristics (employee band → ACV range midpoint) for leads without one.
-- This enables students to compute `expected_value = P(conversion) × expected_acv` and practice ranking/top-K selection.
-
-**Engine change needed:** The snapshot builder must join opportunity ACV data gated by snapshot day, with a fallback to account-band heuristic ACV.
-
-### R2 — Safe temporal / momentum features
-
-**Problem:** v1–v3 engagement features are cumulative counts with no temporal shape. Real lead scoring uses recency and momentum signals.
-
-**Requirement:**
-- Include exactly one momentum feature: `touches_week_1` (touches in days 0–7 after lead creation).
-- This is strictly pre-snapshot (snapshot is at day 21+) and gives students a "first-week intensity" signal to compare against total touches.
-- Additionally, `days_since_first_touch` (snapshot_day minus day of first touch) provides a lead-age signal.
-
-**Engine change needed:** The snapshot builder must compute windowed aggregates from event timestamps.
-
-### R3 — Structured missingness (not only MCAR)
-
-**Problem:** v1–v3 inject missingness randomly (MCAR). Real CRM data has structured gaps.
-
-**Requirement:** Implement three missingness patterns:
-1. **Natural (structural):** `days_since_last_touch` is NaN when `total_touches == 0` (no touches recorded). Already exists but must be preserved.
-2. **Conditional on source:** `web_sessions` is missing for ~15% of `sdr_outbound` leads (CRM tracking often not set up for outbound-sourced leads) but only ~2% of `inbound_marketing` leads.
-3. **Role data gap:** `seniority` is missing for ~8% of `partner_referral` leads (referral partners don't always provide full contact details).
-
-**Engine change needed:** Missingness injection in the snapshot builder, conditioned on feature values.
-
-### R4 — Deliberate leakage trap
-
-**Problem:** Students need to practice identifying leakage, but v1–v3 either have accidental leakage (bad) or none at all (missed teaching opportunity).
-
-**Requirement:**
-- Include one feature `total_touches_all` that counts **all** touches over the full 90-day window, not just up to snapshot.
-- This feature is strongly predictive (uses future data) but not perfectly deterministic (it correlates with but doesn't fully determine conversion).
-- The feature MUST be clearly labeled as "intentionally invalid — included for leakage discussion" in `RELEASE_v4.md` and the feature dictionary.
-- The validation script must flag it, but the v4 build script intentionally includes it.
-- The `BACKGROUND.md` / student instructions must NOT reveal the trap — students should discover it through EDA.
-
-**Engine change needed:** The snapshot builder computes a second touch count using the full horizon.
-
-### R5 — Reduce redundancy
-
-**Problem:** `total_touches = inbound_touches + outbound_touches` is a perfect linear dependency. Students may be confused by it, or models waste a degree of freedom.
-
-**Requirement:**
-- Drop `total_touches` from v4. Keep `inbound_touches` and `outbound_touches` as the touch breakdown.
-- Note: `total_touches_all` (the leakage trap from R4) is a different feature and is kept.
-- Document this as a teaching point: "you can derive total from inbound + outbound."
-
-### R6 — Stronger category signal
-
-**Problem:** At intro difficulty, category conversion rates span only 2–11%. This makes the dataset nearly impossible to model well (AUC ~0.62).
-
-**Requirement:**
-- The engine must produce category-level conversion rate spreads of at least 15–25% for key features (`contact_role`, `company_revenue`, `seniority`).
-- Target baseline LR AUC: **0.70–0.85** (after snapshot + subsampling).
-- This requires engine changes to the difficulty profile or mechanism weights, not just post-hoc manipulation.
-
-**Engine change needed:** Adjust intro difficulty profile or mechanism policy to produce wider category effects.
-
-### R7 — Robust automated validation
-
-**Requirement:** The v4 dataset must pass all of the following automated checks:
-
-| Check | Criterion |
-|---|---|
-| No banned columns | No `current_stage`, `funnel_stage`, `conversion_timestamp`, `is_sql` |
-| No deterministic groups | For every feature value with n≥50: conversion rate in [2%, 98%] |
-| Conversion rate | In [15%, 40%] |
-| Baseline LR AUC | In [0.65, 0.90] (all features except leakage trap) |
-| Leakage trap AUC boost | AUC with trap > AUC without trap by ≥0.03 |
-| Missingness per column | Each column with nulls: 1–15% missing |
-| Missingness structure | `web_sessions` missing rate for `sdr_outbound` > 3× rate for `inbound_marketing` |
-| Row count | Exactly 1,000 |
-| Column count | 16–18 (features + target) |
-| Reproducibility | Same seed → identical output |
-
-## v4 target column set
-
-| # | Column | Type | Source | Notes |
-|---|---|---|---|---|
-| 1 | `industry` | categorical | account | 4 values |
-| 2 | `region` | categorical | account | US, UK |
-| 3 | `company_size` | categorical | account | 4 bands |
-| 4 | `company_revenue` | categorical | account | 4 bands |
-| 5 | `contact_role` | categorical | contact | 4 roles |
-| 6 | `seniority` | categorical | contact | 5 levels (~8% missing for partner_referral) |
-| 7 | `lead_source` | categorical | lead | 3 channels |
-| 8 | `opportunity_created` | binary 0/1 | derived | Opp opened by snapshot day |
-| 9 | `demo_completed` | binary 0/1 | derived | Demo done by snapshot day |
-| 10 | `expected_acv` | numeric | derived | Opp ACV if available, else band midpoint (R1) |
-| 11 | `inbound_touches` | integer | events ≤ snapshot | Inbound touchpoints |
-| 12 | `outbound_touches` | integer | events ≤ snapshot | Outbound touchpoints |
-| 13 | `touches_week_1` | integer | events ≤ day 7 | First-week touch intensity (R2) |
-| 14 | `web_sessions` | integer | events ≤ snapshot | Sessions (~15% missing for outbound, ~2% inbound) |
-| 15 | `sales_activities` | integer | events ≤ snapshot | Sales activities count |
-| 16 | `days_since_last_touch` | float | events ≤ snapshot | Natural NaN when no touches |
-| 17 | `total_touches_all` | integer | **ALL events** | ⚠️ LEAKAGE TRAP — uses full 90-day window |
-| 18 | `converted` | binary 0/1 | target | Converted within 90 days |
-
-Total: 17 features + 1 target = 18 columns.
-
-## Non-goals for v4
-
-- v4 does NOT require engine changes to the simulation loop itself (stage transitions, churn, conversion hazard).
-- v4 does NOT change the relational bundle format or task splits.
-- v4 does NOT require a new recipe — it uses `b2b_saas_procurement_v1` with adjusted difficulty tuning.
-- v4 does NOT need to change the `student_public` / `research_instructor` exposure modes.
diff --git a/docs/v4/planning_pr_review.md b/docs/v4/planning_pr_review.md
new file mode 100644
index 0000000..3290d34
--- /dev/null
+++ b/docs/v4/planning_pr_review.md
@@ -0,0 +1,67 @@
+# PR #19 Self-Review — Critical Assessment
+
+## Point 1: Unvalidated numbers (`category_effect_scale`, AUC target)
+
+**Issue:** `category_effect_scale: 1.8` appears in three docs as a known-good value, but has zero empirical backing. Similarly, `snapshot_day=21` gave AUC 0.622 in v3 — below the 0.65 floor mandated here. The plan has a circular dependency: AUC target assumes the engine change works, engine change spec assumes AUC target is reachable.
+
+**Treatment:** Run a spike experiment before merging. Patch `category_effect_scale` into mechanism policy, generate a 5000-lead bundle at day-21 snapshot, measure category spread + LR AUC. Record results in engine_changes_spec. Adjust the number before it's enshrined across multiple documents.
+
+---
+
+## Point 2: Five overlapping spec documents
+
+**Issue:** Five docs (~500 lines) to describe ~550 lines of implementation. Significant content overlap — "18 columns" appears in requirements, contract, AND validation spec. Missingness rates appear in three places. Changing one number means updating 3 files.
+
+**Treatment:** Consolidate into two docs: one design doc (`docs/v4/design.md`) covering requirements, contract, and engine changes; one validation spec (`docs/v4/validation_spec.md`). Use a single source of truth for shared constants (column list, missingness rates, AUC bounds).
+
+---
+
+## Point 3: "No sim loop changes" masks a known bug
+
+**Issue:** Repeatedly emphasizing "no changes to engine.py" as a feature, when the actual problem — `is_sql=False → 0% conversion` creating deterministic groups — lives there. The plan designs around a bug it refuses to fix, but doesn't state that clearly.
+
+**Treatment:** Add an explicit "Known Limitations & Workarounds" section to the design doc. State plainly: v4's column set excludes `reached_sql` and `has_opportunity` because `is_sql=False → 0% conversion` is a simulation invariant we're choosing not to fix here. Link to a tracked issue for the future engine fix. Don't hide it in a deferred-items table.
+
+---
+
+## Point 4: Arbitrary missingness rates
+
+**Issue:** Missingness rates (15%/2%/5% for web_sessions, 8%/1% for seniority) are unjustified. The validation check ("outbound > 3× inbound") is a tautology given the hardcoded rates.
+
+**Treatment:** State the pedagogical rationale: rates must be detectable at n=1000 with a chi-squared test at p<0.01, but not so extreme students can't impute. Validate this claim in the spike experiment. Acknowledge these are tunable, not ground truth.
+
+---
+
+## Point 5: No failure mode handling
+
+**Issue:** The plan assumes every parameter works on the first try. No guidance for what to do when AUC is too low/high, leakage trap doesn't boost, or subsampling destroys signal. For a dataset on v4 because v1–v3 had unforeseen problems, this is remarkably optimistic.
+
+**Treatment:** Add a "Tuning Protocol" decision tree:
+- AUC < 0.65 → increase `category_effect_scale` (try 2.0/2.5/3.0)
+- AUC > 0.90 → decrease scale or add noise
+- Leakage trap boost < 0.03 → widen snapshot window gap (day 14 instead of 21)
+- Subsampling destroys signal → increase n_leads from 5000 to 10000
+
+---
+
+## Point 6: AGENTS.md will rot
+
+**Issue:** v4-specific content (file-per-milestone lists, validation checklist, testing commands) hardcoded into a permanent repo doc. Becomes stale noise after v4 ships.
+
+**Treatment:** Move v4-specific content into `docs/v4/` only. AGENTS.md keeps durable conventions plus a single pointer: "For v4 implementation details, see `docs/v4/`." Delete v4 content from AGENTS.md after v4 ships — or don't put it there in the first place.
+
+---
+
+## Point 7: `expected_acv` underspecified
+
+**Issue:** "Opportunity ACV if opp created by snapshot; else band midpoint" — but what's the midpoint of "$100M+"? What if band is null? One table row in the longest spec doc.
+
+**Treatment:** Define band→midpoint mapping as an explicit lookup table in the design doc. Specify null-band behavior (population median or NaN).
+
+---
+
+## Point 8: M1/M2 coupling is too rigid
+
+**Issue:** M1 (engine knob) and M2 (build pipeline) are tightly coupled — M1 can't be validated without M2's build script. "Strictly sequential" milestones pretend otherwise. In practice, both will be developed with feedback loops.
+
+**Treatment:** Merge M1 and M2 into a single milestone with two deliverables. One PR with both the engine knob and the build script, validated end-to-end, is more honest and more reviewable.
diff --git a/docs/v4/validation_spec.md b/docs/v4/validation_spec.md
index 4c79c62..33158b0 100644
--- a/docs/v4/validation_spec.md
+++ b/docs/v4/validation_spec.md
@@ -1,5 +1,7 @@
 # v4 Validation Specification
 
+> Companion to `design.md`. Column set, missingness rates, and AUC targets are defined there — this doc specifies the automated checks.
+
 ## Overview
 
 v4 validation operates at two levels:
diff --git a/scripts/spike_category_signal.py b/scripts/spike_category_signal.py
new file mode 100644
index 0000000..27ba972
--- /dev/null
+++ b/scripts/spike_category_signal.py
@@ -0,0 +1,259 @@
+#!/usr/bin/env python3
+"""Spike experiment: measure category → conversion signal under different settings.
+
+Tests:
+1. Baseline (current engine) — expect near-zero category signal
+2. Correlated observables (1x boost) — seniority/revenue/source → latent traits
+3. Correlated observables (1.8x boost) — stronger correlation
+
+Reports category spread (max - min conversion rate) per categorical feature
+and logistic regression AUC at day-21 snapshot.
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import roc_auc_score
+from sklearn.preprocessing import LabelEncoder
+
+# Ensure the package is importable.
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+from leadforge.api.generator import Generator
+from leadforge.render.snapshots import build_snapshot
+from leadforge.simulation.engine import simulate_world
+from leadforge.simulation.population import PopulationResult, build_population
+from leadforge.structure.sampler import sample_hidden_graph
+
+SEED = 42
+N_LEADS = 5000
+SUBSAMPLE_N = 1000
+TARGET_RATE = 0.30
+CAT_FEATURES = [
+    "industry",
+    "region",
+    "estimated_revenue_band",
+    "role_function",
+    "seniority",
+    "lead_source",
+]
+
+# Base boosts (scale=1.0)
+SENIORITY_BOOST = {
+    "individual_contributor": -0.15,
+    "manager": -0.05,
+    "director": 0.05,
+    "vp": 0.12,
+    "c_suite": 0.20,
+}
+REVENUE_BOOST = {
+    "$1M-$10M": -0.10,
+    "$10M-$50M": 0.0,
+    "$50M-$200M": 0.10,
+    "$200M+": 0.18,
+}
+SOURCE_BOOST = {
+    "partner_referral": 0.12,
+    "inbound_marketing": 0.05,
+    "sdr_outbound": -0.08,
+}
+
+
+def subsample(df: pd.DataFrame, rng: np.random.RandomState) -> pd.DataFrame:
+    """Stratified subsample to SUBSAMPLE_N rows at TARGET_RATE conversion."""
+    positives = df[df["converted_within_90_days"]]
+    negatives = df[~df["converted_within_90_days"]]
+    n_pos = int(SUBSAMPLE_N * TARGET_RATE)
+    n_neg = SUBSAMPLE_N - n_pos
+
+    if len(positives) < n_pos:
+        print(f"  WARNING: only {len(positives)} positives, need {n_pos}")
+        n_pos = len(positives)
+    if len(negatives) < n_neg:
+        print(f"  WARNING: only {len(negatives)} negatives, need {n_neg}")
+        n_neg = len(negatives)
+
+    pos_sample = positives.sample(n=n_pos, random_state=rng)
+    neg_sample = negatives.sample(n=n_neg, random_state=rng)
+    return pd.concat([pos_sample, neg_sample]).sample(frac=1, random_state=rng)
+
+
+def measure_category_spread(df: pd.DataFrame) -> dict[str, dict]:
+    """Conversion rate spread for groups with n >= 50, plus per-value detail."""
+    results = {}
+    for col in CAT_FEATURES:
+        if col not in df.columns:
+            continue
+        stats = df.groupby(col)["converted_within_90_days"].agg(["mean", "count"])
+        large = stats[stats["count"] >= 50]
+        spread = float(large["mean"].max() - large["mean"].min()) if len(large) >= 2 else 0.0
+        # Show per-value rates for groups with n >= 30
+        detail = stats[stats["count"] >= 30].sort_values("mean", ascending=False)
+        results[col] = {
+            "spread": spread,
+            "detail": {str(v): (f"{r['mean']:.1%}", int(r["count"])) for v, r in detail.iterrows()},
+        }
+    return results
+
+
+def measure_auc(df: pd.DataFrame) -> float:
+    """Logistic regression AUC using all snapshot features."""
+    feature_cols = [c for c in df.columns if c != "converted_within_90_days"]
+    x_df = df[feature_cols].copy()
+    y = df["converted_within_90_days"].astype(int)
+
+    for col in x_df.select_dtypes(include=["object", "category"]).columns:
+        le = LabelEncoder()
+        x_df[col] = le.fit_transform(x_df[col].astype(str))
+
+    x_df = x_df.select_dtypes(include=[np.number])
+    x_df = x_df.fillna(x_df.median())
+
+    lr = LogisticRegression(max_iter=1000, random_state=42)
+    lr.fit(x_df, y)
+    probs = lr.predict_proba(x_df)[:, 1]
+    return float(roc_auc_score(y, probs))
+
+
+def patch_population(pop: PopulationResult, scale: float = 1.0) -> None:
+    """Correlate observable categories with latent traits."""
+    # Seniority → latent_contact_authority
+    for contact in pop.contacts:
+        cid = contact.contact_id
+        if cid in pop.latent_state.contact_latents:
+            boost = SENIORITY_BOOST.get(contact.seniority, 0.0) * scale
+            traits = pop.latent_state.contact_latents[cid]
+            traits["latent_contact_authority"] = max(
+                0.0, min(1.0, traits["latent_contact_authority"] + boost)
+            )
+
+    # Revenue band → latent_account_fit
+    for account in pop.accounts:
+        aid = account.account_id
+        if aid in pop.latent_state.account_latents:
+            boost = REVENUE_BOOST.get(account.estimated_revenue_band, 0.0) * scale
+            traits = pop.latent_state.account_latents[aid]
+            traits["latent_account_fit"] = max(0.0, min(1.0, traits["latent_account_fit"] + boost))
+
+    # Lead source → latent_engagement_propensity
+    for lead in pop.leads:
+        cid = lead.contact_id
+        if cid in pop.latent_state.contact_latents:
+            boost = SOURCE_BOOST.get(lead.lead_source, 0.0) * scale
+            traits = pop.latent_state.contact_latents[cid]
+            traits["latent_engagement_propensity"] = max(
+                0.0, min(1.0, traits["latent_engagement_propensity"] + boost)
+            )
+
+
+def run_pipeline(label: str, gen: Generator, scale: float | None = None) -> None:
+    """Generate, optionally patch, simulate, snapshot, subsample, measure."""
+    print(f"\n{'=' * 60}")
+    print(f"  {label}")
+    print(f"{'=' * 60}")
+
+    config = gen._world_spec.config
+    narrative = gen._world_spec.narrative
+    if narrative is None:
+        raise RuntimeError("No narrative loaded")
+
+    world_graph = sample_hidden_graph(config.seed)
+    print(f"  Motif family: {world_graph.motif_family}")
+
+    pop = build_population(config, narrative, world_graph)
+
+    if scale is not None:
+        patch_population(pop, scale=scale)
+
+    sim = simulate_world(config, pop, world_graph)
+    snapshot = build_snapshot(sim, pop)
+
+    raw_rate = snapshot["converted_within_90_days"].mean()
+    print(f"  Raw conversion rate: {raw_rate:.1%} (n={len(snapshot)})")
+
+    rng = np.random.RandomState(SEED)
+    df = subsample(snapshot, rng)
+    actual_rate = df["converted_within_90_days"].mean()
+    print(f"  Subsampled: n={len(df)}, conversion={actual_rate:.1%}")
+
+    results = measure_category_spread(df)
+    print("\n  Category spreads (groups n>=50):")
+    for feat in CAT_FEATURES:
+        if feat not in results:
+            continue
+        info = results[feat]
+        print(f"\n    {feat}: spread={info['spread']:.1%}")
+        for val, (rate, n) in info["detail"].items():
+            marker = "*" if n >= 50 else " "
+            print(f"      {marker} {val:30s} rate={rate} n={n}")
+
+    auc = measure_auc(df)
+    print(f"\n  Logistic regression AUC (train): {auc:.3f}")
+    return auc
+
+
+def main() -> None:
+    results = {}
+
+    # Experiment 1: Baseline
+    gen = Generator.from_recipe(
+        "b2b_saas_procurement_v1",
+        seed=SEED,
+        exposure_mode="research_instructor",
+        n_leads=N_LEADS,
+        difficulty="intro",
+    )
+    results["baseline"] = run_pipeline("BASELINE (current engine)", gen, scale=None)
+
+    # Experiment 2: Scale 1.0
+    gen2 = Generator.from_recipe(
+        "b2b_saas_procurement_v1",
+        seed=SEED,
+        exposure_mode="research_instructor",
+        n_leads=N_LEADS,
+        difficulty="intro",
+    )
+    results["scale_1.0"] = run_pipeline("PATCHED scale=1.0", gen2, scale=1.0)
+
+    # Experiment 3: Scale 1.8
+    gen3 = Generator.from_recipe(
+        "b2b_saas_procurement_v1",
+        seed=SEED,
+        exposure_mode="research_instructor",
+        n_leads=N_LEADS,
+        difficulty="intro",
+    )
+    results["scale_1.8"] = run_pipeline("PATCHED scale=1.8", gen3, scale=1.8)
+
+    # Experiment 4: Scale 2.5
+    gen4 = Generator.from_recipe(
+        "b2b_saas_procurement_v1",
+        seed=SEED,
+        exposure_mode="research_instructor",
+        n_leads=N_LEADS,
+        difficulty="intro",
+    )
+    results["scale_2.5"] = run_pipeline("PATCHED scale=2.5", gen4, scale=2.5)
+
+    # Summary
+    print(f"\n{'=' * 60}")
+    print("  SUMMARY")
+    print(f"{'=' * 60}")
+    for label, auc in results.items():
+        print(f"  {label:<30s} AUC={auc:.3f}")
+
+    print()
+    print("  KEY FINDING: The spec's approach of scaling CategoricalInfluence")
+    print("  weights in LatentScore is incorrect — CategoricalInfluence is")
+    print("  not used in the conversion score. The correct approach is to")
+    print("  correlate observable categories with latent traits during")
+    print("  population generation.")
+
+
+if __name__ == "__main__":
+    main()