From 1f45df09976b1cf10e5cca6a0a366a3270ff4364 Mon Sep 17 00:00:00 2001
From: Shay Palachy <shaypal5@users.noreply.github.com>
Date: Wed, 29 Apr 2026 15:40:48 +0300
Subject: [PATCH] plan: v4 lead scoring dataset + leadforge engine roadmap
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add comprehensive v4 planning docs, updated agent instructions, and
revised project roadmap driven by dataset needs from v1–v3 iterations.

Docs added:
- docs/v4/lead_scoring_v4_requirements.md — 7 requirements (value features,
  temporal momentum, structured missingness, leakage trap, redundancy fix,
  stronger category signal, robust validation)
- docs/v4/dataset_contract.md — schema contract, temporal gates, missingness
  patterns, subsampling rules
- docs/v4/engine_changes_spec.md — category effect scaling, windowed snapshot
  builder, structured missingness, leakage trap feature
- docs/v4/validation_spec.md — 8 mandatory checks + 3 warning checks
- docs/v4/implementation_plan.md — 4 milestones (M0–M3) with acceptance
  criteria and explicit mapping from existing roadmap items

Updated:
- CLAUDE.md — added repo map, generation workflow, student_public invariants,
  feature addition guide, v4 plan pointers
- AGENTS.md — added v4 implementation guide, coding conventions, validation
  checklist, local testing commands
- .agent-plan.md — v4 milestones as next work; M12–M15 items explicitly
  triaged (deferred/absorbed/discarded)

No code changes. All 590 existing tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 .agent-plan.md                          |  95 ++++++++++---
 AGENTS.md                               |  75 +++++++++++
 CLAUDE.md                               | 146 ++++++++++++++++++++
 docs/v4/dataset_contract.md             |  72 ++++++++++
 docs/v4/engine_changes_spec.md          | 170 +++++++++++++++++++++++
 docs/v4/implementation_plan.md          | 172 ++++++++++++++++++++++++
 docs/v4/lead_scoring_v4_requirements.md | 131 ++++++++++++++++++
 docs/v4/validation_spec.md              | 108 +++++++++++++++
 8 files changed, 949 insertions(+), 20 deletions(-)
 create mode 100644 docs/v4/dataset_contract.md
 create mode 100644 docs/v4/engine_changes_spec.md
 create mode 100644 docs/v4/implementation_plan.md
 create mode 100644 docs/v4/lead_scoring_v4_requirements.md
 create mode 100644 docs/v4/validation_spec.md

diff --git a/.agent-plan.md b/.agent-plan.md
index 1139182..fd215c5 100644
--- a/.agent-plan.md
+++ b/.agent-plan.md
@@ -10,22 +10,90 @@
 
 ---
 
-## Next Up — Milestone 12: CLI polish + JSON output (v0.5.0)
+## Next Up — v4 Lead Scoring Dataset
 
-Goal: Polish CLI commands with JSON output mode, richer help text, and progress feedback.
+The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes followed by dataset build scripts.
 
-- [ ] Add `--json` flag to `inspect` and `validate` for machine-readable output
-- [ ] Add `--strict` flag to `validate` to control whether realism checks are errors vs warnings
-- [ ] Improve CLI help text and error messages
-- [ ] Tests for JSON output mode
+See `docs/v4/implementation_plan.md` for full details.
+
+### v4-M0: Requirements + planning ⬜ (this PR)
+
+- [x] `docs/v4/lead_scoring_v4_requirements.md`
+- [x] `docs/v4/dataset_contract.md`
+- [x] `docs/v4/validation_spec.md`
+- [x] `docs/v4/engine_changes_spec.md`
+- [x] `docs/v4/implementation_plan.md`
+- [x] Updated `CLAUDE.md` with repo map + generation commands
+- [x] Updated `AGENTS.md` with v4 implementation guide
+- [x] Updated `.agent-plan.md` (this file)
+
+### v4-M1: Engine — category signal + windowed snapshots ⬜
+
+- [ ] Add `category_effect_scale` to difficulty profiles
+- [ ] Apply scale in `mechanisms/policies.py`
+- [ ] Add `snapshot_day` parameter to `render/snapshots.py`
+- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv`
+- [ ] Add new `FeatureSpec` entries to `schema/features.py`
+- [ ] Tests for all changes
+- [ ] Verify category spread ≥15% for key features at intro difficulty
+
+### v4-M2: Build pipeline + validation ⬜
+
+- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness
+- [ ] `scripts/validate_v4_dataset.py` — full validation per `docs/v4/validation_spec.md`
+- [ ] Generate test dataset and verify all checks pass
+- [ ] LR AUC 0.65–0.90 (without trap); ≥0.03 boost with trap
+
+### v4-M3: Documentation + release ⬜
+
+- [ ] Generate `lead_scoring_intro_v4.csv` (in datasets-private repo)
+- [ ] Write `RELEASE_v4.md`
+- [ ] Update dataset repo README
+- [ ] Update `.agent-plan.md` to reflect completion
+
+---
+
+## Deferred Items
+
+### From existing roadmap (M12–M15)
+
+| Item | Status | Rationale |
+|---|---|---|
+| M12: CLI `--json` flag | Deferred | No consumer needs it yet; add post-v4 |
+| M12: CLI `--strict` flag | Deferred | Per-check control is better than global flag |
+| M12: CLI help text polish | Deferred | Low priority vs dataset |
+| M14: Sample bundle commit | Absorbed into v4-M3 | v4 dataset IS the sample |
+| M14: Notebook 1 (inspecting world) | Deferred | Do after v4 ships |
+| M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this |
+| M14: Notebook 3 (public vs instructor) | Discarded | No current audience |
+| M14: Notebook 4 (recipe customization) | Discarded | Premature |
+| M15: Docs polish + v1.0 RC | Deferred | Do after v4 ships |
+
+### From post-v1 list
+
+- Second vertical
+- LTV labels as first-class task outputs
+- Continuous-time / richer event engine
+- Plugin architecture
+- External-API enrichment
+- Web UI or dashboard
+- Engine fix: `is_sql=False` → never converts (deterministic invariant)
 
 ---
 
 ## Context Pointers
 
-- Milestone 12 scope: `docs/leadforge_implementation_plan.md` §10 "Milestone 12"
+- v4 requirements: `docs/v4/lead_scoring_v4_requirements.md`
+- v4 dataset contract: `docs/v4/dataset_contract.md`
+- v4 engine changes: `docs/v4/engine_changes_spec.md`
+- v4 validation spec: `docs/v4/validation_spec.md`
+- v4 implementation plan: `docs/v4/implementation_plan.md`
+- Existing roadmap: `docs/leadforge_implementation_plan.md`
 - CLI commands: `leadforge/cli/commands/`
 - Validation modules: `leadforge/validation/`
+- Snapshot builder: `leadforge/render/snapshots.py`
+- Mechanism policy: `leadforge/mechanisms/policies.py`
+- Difficulty profiles: `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml`
 
 ---
 
@@ -156,16 +224,3 @@ Goal: Polish CLI commands with JSON output mode, richer help text, and progress
 - `leadforge/recipes/`: registry + `b2b_saas_procurement_v1/recipe.yaml`
 - `.github/workflows/ci.yml`: lint, typecheck, test matrix (3.11 + 3.12) with coverage upload
 - 20 tests passing; ruff + mypy clean
-
----
-
-## Deferred (Post-v1)
-
-- Second vertical
-- LTV labels as first-class task outputs
-- Continuous-time / richer event engine
-- Plugin architecture
-- External-API enrichment
-- Web UI or dashboard
-
-See `docs/leadforge_implementation_plan.md` §10 for the full deferral list.
diff --git a/AGENTS.md b/AGENTS.md
index 0827d08..d67ced5 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -25,3 +25,78 @@ Do **not** leave threads unresolved after the commit is pushed.
 ## Branch & PR Conventions
 
 See CLAUDE.md for the full mandatory branch/PR workflow (branch → commit → update `.agent-plan.md` → open PR).
+
+---
+
+## v4 Implementation Guide
+
+### What is v4?
+
+A pedagogically improved lead scoring dataset (single CSV) for an intro ML course. The engine changes are small and targeted. See `docs/v4/` for full specs.
+
+### Implementation order
+
+```
+v4-M0 (planning PR — already done)
+  └── v4-M1: engine changes (category signal + windowed snapshots)
+        └── v4-M2: build pipeline + validation scripts
+              └── v4-M3: dataset generation + release docs
+```
+
+### Key files to modify per milestone
+
+**v4-M1 (engine):**
+- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale`
+- `leadforge/mechanisms/policies.py` — apply scale to categorical influences
+- `leadforge/render/snapshots.py` — add `snapshot_day` param, windowed aggregation
+- `leadforge/schema/features.py` — add new FeatureSpec entries
+- Tests in `tests/mechanisms/` and `tests/render/`
+
+**v4-M2 (build pipeline):**
+- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
+- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
+- These live in the leadforge repo (not datasets-private)
+
+**v4-M3 (release):**
+- Work in `leadforge-datasets-private` repo
+- `lead_scoring_intro/lead_scoring_intro_v4.csv`
+- `lead_scoring_intro/RELEASE_v4.md`
+
+### Coding conventions for v4
+
+1. **Backward compatibility:** All engine changes must default to current behavior. New parameters must have defaults that produce identical output when unset.
+2. **No simulation loop changes:** Do not modify the daily step logic in `engine.py`. v4 changes are in mechanism weights and snapshot rendering only.
+3. **Temporal correctness:** Every feature computation must be explicitly gated by snapshot day. Use `event_timestamp <= lead_created_at + snapshot_day` — never `<`.
+4. **Test coverage:** Every new parameter and feature must have unit tests. Test both `snapshot_day=None` (backward compat) and `snapshot_day=21` (v4 mode).
+5. **Determinism:** All new stochastic operations must use seeded RNG. Verify with a determinism test (same seed → identical output).
+
+### Validation checklist for v4 dataset
+
+Before declaring v4-M2 complete, the dataset must pass:
+
+- [ ] 1,000 rows, 18 columns
+- [ ] 30% conversion rate (±1%)
+- [ ] No deterministic groups (n≥50 at 0% or 100% conversion)
+- [ ] LR AUC 0.65–0.90 (without leakage trap)
+- [ ] LR AUC boost ≥0.03 when leakage trap included
+- [ ] `web_sessions` missingness: outbound rate > 3× inbound rate
+- [ ] `seniority` missingness: partner_referral rate > 3× others
+- [ ] Reproducible with seed 42
+- [ ] `total_touches_all` uses full 90-day data (confirmed by AUC boost)
+
+### How to test engine changes locally
+
+```bash
+# Quick smoke test: generate a small bundle and inspect
+leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 --difficulty intro --n-leads 1000 --out /tmp/test_bundle
+leadforge validate /tmp/test_bundle
+
+# Check category signal spread
+python -c "
+import pandas as pd
+df = pd.read_parquet('/tmp/test_bundle/tasks/converted_within_90_days/train.parquet')
+for col in ['role_function', 'seniority', 'estimated_revenue_band']:
+    rates = df.groupby(col)['converted_within_90_days'].mean()
+    print(f'{col}: spread={rates.max()-rates.min():.1%}')
+"
+```
diff --git a/CLAUDE.md b/CLAUDE.md
index 0a9a5dc..1e3ae7e 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -212,7 +212,153 @@ Key abstractions: `Recipe`, `GenerationConfig`, `WorldSpec`, `WorldBundle`, `Exp
 
 ---
 
+## Repository Map
+
+```
+leadforge/                    # Python package root
+├── api/                      # Public API: Generator, Recipe, Bundle
+│   ├── generator.py          # Generator.from_recipe() → .generate() → WorldBundle
+│   ├── recipes.py            # Recipe loading, config resolution
+│   └── bundle.py             # write_bundle() orchestrator
+├── cli/                      # Click CLI
+│   ├── main.py               # CLI entry point
+│   └── commands/             # generate, inspect, validate, list_recipes
+├── core/                     # Foundational utilities
+│   ├── rng.py                # RNGRoot with named substreams
+│   ├── ids.py                # Deterministic ID generation (acct_000001, etc.)
+│   ├── models.py             # GenerationConfig, WorldSpec, WorldBundle
+│   ├── enums.py              # ExposureMode, DifficultyProfile
+│   └── exceptions.py         # Custom exception hierarchy
+├── narrative/                # Vertical narrative (company, market, personas)
+│   ├── spec.py               # NarrativeSpec and sub-spec dataclasses
+│   └── dataset_card.py       # Markdown dataset card renderer
+├── schema/                   # Relational data model
+│   ├── entities.py           # 9 entity row dataclasses (AccountRow, LeadRow, etc.)
+│   ├── features.py           # LEAD_SNAPSHOT_FEATURES — canonical feature spec
+│   ├── relationships.py      # FK constraints (ALL_CONSTRAINTS)
+│   ├── tasks.py              # SplitSpec, TaskManifest, CONVERTED_WITHIN_90_DAYS
+│   └── dictionaries.py       # Feature dictionary CSV writer
+├── structure/                # Hidden world graph
+│   ├── graph.py              # WorldGraph (DAG wrapper)
+│   ├── motifs.py             # 5 motif families
+│   ├── rewiring.py           # Stochastic graph perturbation
+│   └── sampler.py            # sample_hidden_graph()
+├── mechanisms/               # Node/edge behavior
+│   ├── policies.py           # assign_mechanisms() — motif → MechanismAssignment
+│   ├── hazards.py            # ConversionHazard
+│   ├── transitions.py        # StageSequence, HazardTransition
+│   ├── counts.py             # PoissonIntensity, RecencyDecayIntensity
+│   ├── categorical.py        # CategoricalInfluence, CHANNEL_QUALITY_SCORES
+│   └── scores.py             # LatentScore
+├── simulation/               # World evolution
+│   ├── engine.py             # simulate_world() — 90-day daily loop
+│   ├── state.py              # LeadSimState (per-lead mutable state)
+│   └── population.py         # build_population() — accounts, contacts, leads
+├── render/                   # Bundle output
+│   ├── snapshots.py          # build_snapshot() — ML-ready lead table
+│   ├── relational.py         # to_dataframes() — 9-table dict
+│   ├── tasks.py              # write_task_splits() — train/valid/test Parquet
+│   └── manifests.py          # build_manifest(), write_manifest()
+├── exposure/                 # Truth filtering
+│   ├── modes.py              # apply_exposure() dispatch
+│   ├── metadata.py           # write_metadata_dir() for instructor mode
+│   └── filters.py            # BundleFilter, FILTERS dict
+├── validation/               # Bundle quality checks
+│   ├── bundle_checks.py      # validate_bundle() orchestrator
+│   ├── invariants.py         # Determinism + exposure monotonicity
+│   ├── realism.py            # Conversion rates, feature ranges, stage diversity
+│   ├── difficulty.py         # Known difficulty profile validation
+│   └── drift.py              # Cross-seed stability
+└── recipes/                  # Recipe definitions
+    └── b2b_saas_procurement_v1/
+        ├── recipe.yaml       # Recipe metadata + defaults
+        ├── narrative.yaml    # Company, product, market, personas, funnel
+        └── difficulty_profiles.yaml  # intro/intermediate/advanced
+```
+
+### Related repos
+
+- **leadforge-datasets-private** — generated dataset archive
+  - `b2b_saas_procurement_v1__intro__seed42/` — full relational bundle
+  - `lead_scoring_intro/` — simplified single-CSV versions (v1–v4)
+  - `scripts/` — build and validation scripts for simplified CSVs
+
+---
+
+## Generation Workflow
+
+### Generate a full bundle
+
+```bash
+leadforge generate \
+  --recipe b2b_saas_procurement_v1 \
+  --seed 42 \
+  --mode student_public \
+  --difficulty intro \
+  --n-leads 5000 \
+  --out ./out/bundle
+```
+
+### Build a simplified CSV (v4 example)
+
+```bash
+# In leadforge-datasets-private repo:
+python scripts/build_v4_snapshot.py /path/to/bundle lead_scoring_intro/lead_scoring_intro_v4.csv
+```
+
+### Validate a simplified CSV
+
+```bash
+python scripts/validate_v4_dataset.py lead_scoring_intro/lead_scoring_intro_v4.csv
+```
+
+### Validate a full bundle
+
+```bash
+leadforge validate ./out/bundle
+```
+
+---
+
+## student_public Mode Invariants
+
+These are non-negotiable for any dataset published in `student_public` mode:
+
+1. **No post-snapshot features** — all features computed from events ≤ snapshot day only.
+2. **No outcome-stage columns** — `current_stage`, `funnel_stage` with `closed_won`/`closed_lost` are banned.
+3. **No deterministic single-feature mapping** — for any feature value with n≥50, conversion rate must be in [2%, 98%].
+4. **No hidden truth** — latent scores, mechanism parameters, world graph not included.
+5. **No direct outcome columns** — `conversion_timestamp`, `close_outcome` are banned.
+6. **No zero-variance features** — every included feature must have ≥2 distinct values.
+
+Exception: deliberately included leakage traps (e.g., `total_touches_all` in v4) must be clearly documented in release notes and feature dictionary.
+
+---
+
+## How to Add New Features to the Snapshot
+
+1. Add a `FeatureSpec` entry to `LEAD_SNAPSHOT_FEATURES` in `leadforge/schema/features.py`.
+2. Compute the feature value in `build_snapshot()` in `leadforge/render/snapshots.py`.
+3. If the feature needs new event data, add it to the simulation loop in `leadforge/simulation/engine.py`.
+4. Update `leadforge/schema/dictionaries.py` if the feature dictionary format changes.
+5. Run `pytest` and `leadforge validate` on a generated bundle.
+6. Update the feature dictionary CSV description.
+
+---
+
+## v4 Dataset Plan
+
+The current focus is producing a v4 lead scoring intro dataset. See `docs/v4/` for:
+- `lead_scoring_v4_requirements.md` — what v4 must achieve
+- `dataset_contract.md` — schema contract and temporal gates
+- `engine_changes_spec.md` — what changes in the engine
+- `validation_spec.md` — automated validation checks
+- `implementation_plan.md` — milestone breakdown
+
+---
+
 ## Reference Docs
 - Design decisions: `docs/leadforge_design_doc.md`
 - Architecture/spec: `docs/leadforge_architecture_spec.md`
 - Implementation roadmap: `docs/leadforge_implementation_plan.md`
+- v4 dataset plan: `docs/v4/implementation_plan.md`
diff --git a/docs/v4/dataset_contract.md b/docs/v4/dataset_contract.md
new file mode 100644
index 0000000..4b7cb0a
--- /dev/null
+++ b/docs/v4/dataset_contract.md
@@ -0,0 +1,72 @@
+# v4 Dataset Contract
+
+## Snapshot definition
+
+- **Snapshot day:** Day 21 after `lead_created_at` (configurable, default 21).
+- **Observation window:** Days 0–21 inclusive. All features computed from events in this window only.
+- **Prediction horizon:** Days 22–90. The target `converted` reflects whether `closed_won` occurs in the full 90-day window.
+- **Temporal guarantee:** No feature (except the explicitly marked leakage trap) uses information from after the snapshot day.
+
+## What is pre-snapshot (valid for features)
+
+| Data source | Temporal gate |
+|---|---|
+| Account attributes | Static — always valid |
+| Contact attributes | Static — always valid |
+| Lead metadata (source, etc.) | Lead creation — always valid |
+| Touch events | `touch_timestamp ≤ lead_created_at + snapshot_day` |
+| Session events | `session_timestamp ≤ lead_created_at + snapshot_day` |
+| Sales activity events | `activity_timestamp ≤ lead_created_at + snapshot_day` |
+| Opportunity records | `opportunity.created_at ≤ lead_created_at + snapshot_day` |
+| ACV estimates | From opportunity if available by snapshot; else account heuristic |
+
+## What is post-snapshot (invalid for features)
+
+| Data | Why invalid |
+|---|---|
+| `current_stage` at day 90 | Contains `closed_won` / `closed_lost` — outcome data |
+| `is_sql` (final state flag) | Engine invariant: `is_sql=False` → never converts. Deterministic. |
+| `conversion_timestamp` | Direct outcome information |
+| Touch/session/activity events after snapshot day | Future data |
+| Opportunity close outcome | Post-outcome |
+| `total_touches_all` | ⚠️ Intentional leakage trap — counts full 90-day touches |
+
+## Leakage trap contract
+
+The feature `total_touches_all` deliberately violates the snapshot boundary:
+- It counts touches over the **full 90-day simulation**, not just up to snapshot.
+- It is included to teach students about temporal leakage detection.
+- It must be clearly marked in the feature dictionary and release notes.
+- The validation script must detect it and flag it (but not fail the build).
+- Removing this feature should drop AUC by ≥0.03.
+
+## Missingness contract
+
+| Column | Pattern | Rate | Condition |
+|---|---|---|---|
+| `days_since_last_touch` | Structural | Natural | NaN when `total touches == 0` by snapshot |
+| `web_sessions` | Source-conditional | ~15% for `sdr_outbound`, ~2% for `inbound_marketing`, ~5% for `partner_referral` | CRM tracking gaps |
+| `seniority` | Source-conditional | ~8% for `partner_referral`, ~1% for others | Referral partners omit contact details |
+| `days_since_last_touch` | Additional MCAR | ~3% | Random CRM logging gaps (on top of structural) |
+
+## Target definition
+
+```
+converted = 1  if lead reached closed_won within 90 days of lead_created_at
+converted = 0  otherwise (including closed_lost, still in funnel, churned)
+```
+
+The target is derived from simulated events, never directly sampled.
+
+## Subsampling contract
+
+- Source bundle: 5,000 leads generated with `b2b_saas_procurement_v1`, seed 42, difficulty intro.
+- Stratified subsampling to 1,000 rows at ~30% conversion rate.
+- All negatives retained (up to 700); positives downsampled.
+- Subsampling preserves within-class feature distributions.
+
+## Reproducibility
+
+- Seed: 42 (or documented if changed).
+- All stochastic operations use `np.random.RandomState(seed)` or derived substreams.
+- Same (seed, recipe, leadforge version) → byte-identical CSV output.
diff --git a/docs/v4/engine_changes_spec.md b/docs/v4/engine_changes_spec.md
new file mode 100644
index 0000000..d3712ca
--- /dev/null
+++ b/docs/v4/engine_changes_spec.md
@@ -0,0 +1,170 @@
+# v4 Engine Changes Specification
+
+## Overview
+
+v4 requires **two categories** of changes to the leadforge codebase:
+1. **Mechanism / difficulty tuning** — make intro difficulty produce stronger category-level signal.
+2. **Snapshot builder enhancements** — compute windowed aggregates, ACV derivation, structured missingness, and the leakage trap feature.
+
+Neither category requires changes to the simulation loop itself (`engine.py`'s daily step logic). The simulation produces the same event stream; we change how features are derived from it.
+
+---
+
+## Change 1: Stronger category signal at intro difficulty
+
+### Problem
+
+The current mechanism policy (`mechanisms/policies.py`) produces conversion rates that are nearly uniform across categories at intro difficulty. For example, `contact_role` spreads only 11% (25.6%–36.7% after subsampling). This yields LR AUC ~0.62, which is too low for a useful teaching dataset.
+
+### Root cause
+
+The `assign_mechanisms()` function builds a `LatentScore` with weights that are quite flat across categories. The intro difficulty profile specifies `signal_strength: 0.90` but this controls noise scale, not the magnitude of category effects.
+
+### Solution
+
+Add **category effect multipliers** to the difficulty profile YAML:
+
+```yaml
+intro:
+  # ... existing fields ...
+  category_effect_scale: 1.8  # amplify category → latent score effects
+```
+
+In `mechanisms/policies.py`, scale the `CategoricalInfluence` weights by `category_effect_scale` when building the `LatentScore`. This widens the gap between, say, `vp_finance` and `it_director` conversion rates without changing the overall noise structure.
+
+### Target outcome
+
+After this change + subsampling to 30%, category spreads should be:
+- `contact_role`: ≥15% spread
+- `company_revenue`: ≥12% spread
+- `seniority`: ≥10% spread
+- Baseline LR AUC: 0.70–0.85
+
+### Files affected
+
+- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale`
+- `leadforge/mechanisms/policies.py` — use `category_effect_scale` when building categorical influences
+- `tests/mechanisms/test_policies.py` — test that different scales produce different spread
+
+### Risk
+
+Low. The change is additive (new config field with a default of 1.0 for backward compatibility). Existing tests continue to pass at `category_effect_scale=1.0`.
+
+---
+
+## Change 2: Snapshot builder — windowed aggregates and new features
+
+### Problem
+
+The current `render/snapshots.py` computes all aggregates over the full simulation horizon. v4 needs aggregates gated by a configurable snapshot day, plus new derived features.
+
+### Solution
+
+Add a new function or extend `build_snapshot()` to accept a `snapshot_day` parameter:
+
+```python
+def build_snapshot(
+    result: SimulationResult,
+    population: PopulationResult,
+    horizon_days: int = 90,
+    snapshot_day: int | None = None,  # NEW — default None means use horizon_days
+) -> pd.DataFrame:
+```
+
+When `snapshot_day` is set, all event aggregations filter to events within `[lead_created_at, lead_created_at + snapshot_day]`.
+
+### New features to compute
+
+| Feature | Computation | Notes |
+|---|---|---|
+| `touches_week_1` | Count touches where `days_after_creation ≤ 7` | Momentum signal |
+| `days_since_first_touch` | `snapshot_day - first_touch_day` (NaN if no touches) | Lead age signal |
+| `expected_acv` | Opportunity ACV if opp created by snapshot; else employee_band midpoint | Value feature |
+| `total_touches_all` | Count of ALL touches over full horizon (ignoring snapshot gate) | Leakage trap |
+
+### Files affected
+
+- `leadforge/render/snapshots.py` — add `snapshot_day` parameter, windowed filtering, new feature computations
+- `leadforge/schema/features.py` — add new `FeatureSpec` entries for the new columns
+- `tests/render/test_snapshots.py` — test windowed aggregation correctness
+
+### Risk
+
+Medium. The snapshot builder is well-tested but core to correctness. The `snapshot_day` parameter should be additive (default `None` preserves existing behavior). New features are computed alongside existing ones.
+
+---
+
+## Change 3: Structured missingness injection
+
+### Problem
+
+Current missingness is MCAR (random injection). v4 needs conditional missingness.
+
+### Solution
+
+Add a missingness injection step to the v4 build script (NOT to the engine's `build_snapshot`). This keeps the engine's output clean and makes missingness a dataset-packaging concern.
+
+The build script (`scripts/build_v4_snapshot.py`) applies missingness after snapshot construction:
+
+```python
+def inject_missingness(df: pd.DataFrame, rng: np.random.RandomState) -> pd.DataFrame:
+    # 1. web_sessions: 15% missing for sdr_outbound, 2% inbound, 5% partner
+    for source, rate in [("sdr_outbound", 0.15), ("inbound_marketing", 0.02), ("partner_referral", 0.05)]:
+        mask = (df["lead_source"] == source) & (rng.random(len(df)) < rate)
+        df.loc[mask, "web_sessions"] = np.nan
+
+    # 2. seniority: 8% missing for partner_referral, 1% for others
+    partner_mask = (df["lead_source"] == "partner_referral") & (rng.random(len(df)) < 0.08)
+    other_mask = (df["lead_source"] != "partner_referral") & (rng.random(len(df)) < 0.01)
+    df.loc[partner_mask | other_mask, "seniority"] = np.nan
+
+    # 3. days_since_last_touch: additional 3% MCAR on top of structural NaN
+    dslt_mask = rng.random(len(df)) < 0.03
+    df.loc[dslt_mask, "days_since_last_touch"] = np.nan
+
+    return df
+```
+
+### Files affected
+
+- `scripts/build_v4_snapshot.py` (new) — missingness injection
+- No changes to `leadforge/` core modules for missingness
+
+### Risk
+
+Low. Missingness is applied post-generation, outside the engine.
+
+---
+
+## Change 4: Leakage trap feature
+
+### Problem
+
+Students need a feature that looks valid but violates temporal boundaries.
+
+### Solution
+
+The v4 build script computes `total_touches_all` by counting ALL touches in the full 90-day window (not gated by snapshot day). This is computed alongside the snapshot but uses different temporal filtering.
+
+### Files affected
+
+- `scripts/build_v4_snapshot.py` — compute `total_touches_all` from full event stream
+- Feature dictionary and release notes — mark as leakage trap
+
+### Risk
+
+None to the engine. The trap is a build-script concern.
+
+---
+
+## Summary of engine vs. script changes
+
+| Change | Where | Risk |
+|---|---|---|
+| Category effect scaling | `leadforge/` core (mechanisms, difficulty profiles) | Low |
+| Snapshot `snapshot_day` parameter | `leadforge/` core (render/snapshots) | Medium |
+| New features (ACV, momentum, first_touch) | `leadforge/` core (render/snapshots, schema/features) | Medium |
+| Structured missingness | `scripts/` (build script only) | Low |
+| Leakage trap | `scripts/` (build script only) | None |
+
+Total engine-side changes: ~200–400 lines across 4–5 files. Build script: ~250 lines new.
diff --git a/docs/v4/implementation_plan.md b/docs/v4/implementation_plan.md
new file mode 100644
index 0000000..89b8238
--- /dev/null
+++ b/docs/v4/implementation_plan.md
@@ -0,0 +1,172 @@
+# v4 Implementation Plan
+
+## Overview
+
+This plan implements the v4 lead scoring dataset in 4 milestones across 4–6 PRs. Each milestone produces testable artifacts and has explicit acceptance criteria.
+
+## Relationship to existing roadmap
+
+v4 work slots into the existing leadforge roadmap as follows:
+
+| Existing milestone | Status | v4 interaction |
+|---|---|---|
+| M0–M11 | ✅ Complete | No changes needed |
+| M12 (CLI polish) | ⬜ Planned | **Deferred** — low priority vs v4 dataset needs. Integrate after v4. |
+| M13 (Validation harness) | ✅ Implemented as M11 | v4 extends with dataset-level validation |
+| M14 (Sample datasets + notebooks) | ⬜ Planned | **Absorbed into v4-M3** — v4 dataset IS the sample dataset |
+| M15 (Docs polish + v1.0 RC) | ⬜ Planned | **Deferred** — do after v4 ships |
+
+### Explicitly discarded items
+
+| Item | Rationale |
+|---|---|
+| M12 `--json` flag for inspect/validate | Nice-to-have; no dataset consumer needs it yet. Can add later. |
+| M12 `--strict` flag for validate | Validation strictness is better controlled per-check, not globally. |
+| M14 Notebook 3 (public vs instructor comparison) | No current audience for this; instructor mode is not used in the course. |
+| M14 Notebook 4 (recipe customization walkthrough) | Premature — recipe system is stable but not user-facing yet. |
+
+### Explicitly kept / integrated items
+
+| Item | How it maps to v4 |
+|---|---|
+| M14 Sample bundle generation | v4-M2 generates the source bundle |
+| M14 Lead-scoring baseline notebook | v4-M3 includes a validation notebook or script |
+| M15 Docs audit | v4-M0 updates CLAUDE.md and AGENTS.md; v4-M3 produces RELEASE_v4.md |
+
+---
+
+## v4 Milestones
+
+### v4-M0: Requirements, contract, and agent instructions
+
+**Goal:** Establish the v4 dataset contract and update repo documentation so implementation can begin immediately.
+
+**Deliverables:**
+- `docs/v4/lead_scoring_v4_requirements.md` — full requirements
+- `docs/v4/dataset_contract.md` — schema contract, temporal gates, missingness
+- `docs/v4/validation_spec.md` — automated check specifications
+- `docs/v4/engine_changes_spec.md` — what changes where and why
+- `docs/v4/implementation_plan.md` — this file
+- Updated `CLAUDE.md` — repository map, generation/validation commands
+- Updated `AGENTS.md` — implementation conventions for v4 work
+- Updated `.agent-plan.md` — reflects v4 as next work
+
+**Acceptance criteria:**
+- [ ] All docs are internally consistent
+- [ ] CLAUDE.md contains repo map and commands
+- [ ] .agent-plan.md points to v4 milestones
+- [ ] No contradictions with existing architecture docs
+
+**PR:** This PR (the planning PR).
+
+---
+
+### v4-M1: Engine — category signal tuning + snapshot enhancements
+
+**Goal:** Make the engine produce datasets with stronger category signal and support windowed snapshot computation.
+
+**Deliverables:**
+1. `difficulty_profiles.yaml` — add `category_effect_scale: 1.8` to intro profile
+2. `mechanisms/policies.py` — apply `category_effect_scale` to categorical influence weights
+3. `render/snapshots.py` — add optional `snapshot_day` parameter for windowed aggregation
+4. `schema/features.py` — add `FeatureSpec` entries for new columns (`touches_week_1`, `days_since_first_touch`, `expected_acv`)
+5. Tests for all changes
+
+**Acceptance criteria:**
+- [ ] `category_effect_scale=1.0` produces identical output to current engine (backward compat)
+- [ ] `category_effect_scale=1.8` produces category spreads ≥15% for `contact_role`
+- [ ] `snapshot_day=21` correctly filters events to first 21 days
+- [ ] `touches_week_1` counts only days 0–7 touches
+- [ ] `expected_acv` uses opportunity ACV when available, else band midpoint
+- [ ] All existing tests pass
+- [ ] New tests cover the new parameters
+
+**Estimated size:** ~400 lines diff across 5 files + tests.
+
+**PR:** Single PR: `feat: v4 engine — category signal tuning + windowed snapshots`
+
+---
+
+### v4-M2: Build pipeline — v4 snapshot builder + structured missingness
+
+**Goal:** Create the v4 build script that transforms a generated bundle into the final CSV.
+
+**Deliverables:**
+1. `scripts/build_v4_snapshot.py` — snapshot builder with:
+   - Day-21 windowed features
+   - Leakage trap feature (`total_touches_all`)
+   - Structured missingness injection
+   - Stratified subsampling to 1,000 rows / 30% conversion
+   - Column selection and renaming
+2. `scripts/validate_v4_dataset.py` — validation script per validation spec
+3. Generated `lead_scoring_intro_v4.csv` (in datasets repo, not leadforge)
+
+**Acceptance criteria:**
+- [ ] Build script produces 1,000 rows × 18 columns
+- [ ] Conversion rate is 30% (±1%)
+- [ ] `total_touches_all` uses full 90-day data (leakage trap)
+- [ ] `web_sessions` missing rate for outbound > 3× inbound rate
+- [ ] `seniority` missing rate for partner_referral > 3× others
+- [ ] `days_since_last_touch` has structural + injected NaNs
+- [ ] Validation script passes all mandatory checks
+- [ ] Baseline LR AUC (without trap) in [0.65, 0.90]
+- [ ] LR AUC boost with trap ≥ 0.03
+- [ ] No deterministic groups (n≥50 at 0% or 100%)
+- [ ] Reproducible with seed 42
+
+**Estimated size:** ~350 lines (build script) + ~200 lines (validator).
+
+**PR:** Single PR: `feat: v4 build pipeline + validation`
+
+---
+
+### v4-M3: Documentation + release
+
+**Goal:** Produce the final dataset files and release documentation.
+
+**Deliverables (in leadforge-datasets-private repo):**
+1. `lead_scoring_intro/lead_scoring_intro_v4.csv`
+2. `lead_scoring_intro/RELEASE_v4.md`
+3. Updated `lead_scoring_intro/BACKGROUND.md` (if needed for v4 framing)
+4. Updated `README.md` (dataset index)
+
+**Deliverables (in leadforge repo):**
+1. Updated `.agent-plan.md` reflecting completion
+
+**Acceptance criteria:**
+- [ ] CSV passes all validation checks
+- [ ] RELEASE_v4.md documents snapshot day, target definition, changes from v3, leakage trap
+- [ ] README in datasets repo marks v4 as recommended
+- [ ] Previous versions marked as superseded
+
+**PR:** Two PRs (one per repo).
+
+---
+
+## Dependency graph
+
+```
+v4-M0 (this PR)
+  └── v4-M1 (engine changes)
+        └── v4-M2 (build pipeline + validation)
+              └── v4-M3 (docs + release)
+```
+
+Strictly sequential — each milestone depends on the previous.
+
+---
+
+## Timeline estimate
+
+Not providing time estimates per project convention. The work is 4 PRs of moderate size (~300–500 lines each).
+
+---
+
+## What this plan does NOT do
+
+- Does not change the simulation loop (`engine.py` daily step logic)
+- Does not change the relational bundle format
+- Does not change exposure modes
+- Does not add new recipes
+- Does not implement M12 (CLI polish) — deferred
+- Does not implement the engine fix for `is_sql=False → never converts` (deferred to a separate issue; v4 avoids `is_sql` entirely)
diff --git a/docs/v4/lead_scoring_v4_requirements.md b/docs/v4/lead_scoring_v4_requirements.md
new file mode 100644
index 0000000..ff2e146
--- /dev/null
+++ b/docs/v4/lead_scoring_v4_requirements.md
@@ -0,0 +1,131 @@
+# Lead Scoring Dataset v4 — Requirements
+
+## Purpose
+
+This document defines the requirements for the **v4 lead scoring intro dataset**, the primary pedagogical output of leadforge for a BA-level intro ML course. It is informed by three prior dataset iterations (v1–v3) and the lessons learned from each.
+
+## Prior version history and lessons
+
+| Version | Key issue | What we learned |
+|---|---|---|
+| v1 | `funnel_stage` contained `closed_won`/`closed_lost` — perfect leakage | Must validate that no single feature determines the target |
+| v2 | Snapshot at day 90 with 90-day target — post-mortem, not prediction | Snapshot must be strictly earlier than outcome horizon |
+| v2 | `reached_sql=0` → 0% conversion (n=127); `has_opportunity=1` → 0% (n=235) | Binary proxies from engine invariants create deterministic groups |
+| v3 | Day-21 snapshot + non-deterministic proxies — clean but AUC only 0.62 | Engine's intro difficulty produces flat category effects; early features lack signal |
+
+## v4 requirements
+
+### R1 — Operational decision framing (capacity + value)
+
+**Problem:** v1–v3 frame lead scoring as pure classification. Real lead scoring is a **decision tool** — ranking leads by expected value, not just probability.
+
+**Requirement:**
+- Include an `expected_acv` numeric feature (estimated annual contract value) available at snapshot time.
+- The feature must be derived from the opportunity table (for leads with an opportunity by snapshot) or from account-level heuristics (employee band → ACV range midpoint) for leads without one.
+- This enables students to compute `expected_value = P(conversion) × expected_acv` and practice ranking/top-K selection.
+
+**Engine change needed:** The snapshot builder must join opportunity ACV data gated by snapshot day, with a fallback to account-band heuristic ACV.
+
+### R2 — Safe temporal / momentum features
+
+**Problem:** v1–v3 engagement features are cumulative counts with no temporal shape. Real lead scoring uses recency and momentum signals.
+
+**Requirement:**
+- Include exactly one momentum feature: `touches_week_1` (touches in days 0–7 after lead creation).
+- This is strictly pre-snapshot (snapshot is at day 21+) and gives students a "first-week intensity" signal to compare against total touches.
+- Additionally, `days_since_first_touch` (snapshot_day minus day of first touch) provides a lead-age signal.
+
+**Engine change needed:** The snapshot builder must compute windowed aggregates from event timestamps.
+
+### R3 — Structured missingness (not only MCAR)
+
+**Problem:** v1–v3 inject missingness randomly (MCAR). Real CRM data has structured gaps.
+
+**Requirement:** Implement three missingness patterns:
+1. **Natural (structural):** `days_since_last_touch` is NaN when `total_touches == 0` (no touches recorded). Already exists but must be preserved.
+2. **Conditional on source:** `web_sessions` is missing for ~15% of `sdr_outbound` leads (CRM tracking often not set up for outbound-sourced leads) but only ~2% of `inbound_marketing` leads.
+3. **Role data gap:** `seniority` is missing for ~8% of `partner_referral` leads (referral partners don't always provide full contact details).
+
+**Engine change needed:** Missingness injection in the snapshot builder, conditioned on feature values.
+
+### R4 — Deliberate leakage trap
+
+**Problem:** Students need to practice identifying leakage, but v1–v3 either have accidental leakage (bad) or none at all (missed teaching opportunity).
+
+**Requirement:**
+- Include one feature `total_touches_all` that counts **all** touches over the full 90-day window, not just up to snapshot.
+- This feature is strongly predictive (uses future data) but not perfectly deterministic (it correlates with but doesn't fully determine conversion).
+- The feature MUST be clearly labeled as "intentionally invalid — included for leakage discussion" in `RELEASE_v4.md` and the feature dictionary.
+- The validation script must flag it, but the v4 build script intentionally includes it.
+- The `BACKGROUND.md` / student instructions must NOT reveal the trap — students should discover it through EDA.
+
+**Engine change needed:** The snapshot builder computes a second touch count using the full horizon.
+
+### R5 — Reduce redundancy
+
+**Problem:** `total_touches = inbound_touches + outbound_touches` is a perfect linear dependency. Students may be confused by it, or models waste a degree of freedom.
+
+**Requirement:**
+- Drop `total_touches` from v4. Keep `inbound_touches` and `outbound_touches` as the touch breakdown.
+- Note: `total_touches_all` (the leakage trap from R4) is a different feature and is kept.
+- Document this as a teaching point: "you can derive total from inbound + outbound."
+
+### R6 — Stronger category signal
+
+**Problem:** At intro difficulty, category conversion rates span only 2–11%. This makes the dataset nearly impossible to model well (AUC ~0.62).
+
+**Requirement:**
+- The engine must produce category-level conversion rate spreads of at least 15–25% for key features (`contact_role`, `company_revenue`, `seniority`).
+- Target baseline LR AUC: **0.70–0.85** (after snapshot + subsampling).
+- This requires engine changes to the difficulty profile or mechanism weights, not just post-hoc manipulation.
+
+**Engine change needed:** Adjust intro difficulty profile or mechanism policy to produce wider category effects.
+
+### R7 — Robust automated validation
+
+**Requirement:** The v4 dataset must pass all of the following automated checks:
+
+| Check | Criterion |
+|---|---|
+| No banned columns | No `current_stage`, `funnel_stage`, `conversion_timestamp`, `is_sql` |
+| No deterministic groups | For every feature value with n≥50: conversion rate in [2%, 98%] |
+| Conversion rate | In [15%, 40%] |
+| Baseline LR AUC | In [0.65, 0.90] (all features except leakage trap) |
+| Leakage trap AUC boost | AUC with trap > AUC without trap by ≥0.03 |
+| Missingness per column | Each column with nulls: 1–15% missing |
+| Missingness structure | `web_sessions` missing rate for `sdr_outbound` > 3× rate for `inbound_marketing` |
+| Row count | Exactly 1,000 |
+| Column count | 16–18 (features + target) |
+| Reproducibility | Same seed → identical output |
+
+## v4 target column set
+
+| # | Column | Type | Source | Notes |
+|---|---|---|---|---|
+| 1 | `industry` | categorical | account | 4 values |
+| 2 | `region` | categorical | account | US, UK |
+| 3 | `company_size` | categorical | account | 4 bands |
+| 4 | `company_revenue` | categorical | account | 4 bands |
+| 5 | `contact_role` | categorical | contact | 4 roles |
+| 6 | `seniority` | categorical | contact | 5 levels (~8% missing for partner_referral) |
+| 7 | `lead_source` | categorical | lead | 3 channels |
+| 8 | `opportunity_created` | binary 0/1 | derived | Opp opened by snapshot day |
+| 9 | `demo_completed` | binary 0/1 | derived | Demo done by snapshot day |
+| 10 | `expected_acv` | numeric | derived | Opp ACV if available, else band midpoint (R1) |
+| 11 | `inbound_touches` | integer | events ≤ snapshot | Inbound touchpoints |
+| 12 | `outbound_touches` | integer | events ≤ snapshot | Outbound touchpoints |
+| 13 | `touches_week_1` | integer | events ≤ day 7 | First-week touch intensity (R2) |
+| 14 | `web_sessions` | integer | events ≤ snapshot | Sessions (~15% missing for outbound, ~2% inbound) |
+| 15 | `sales_activities` | integer | events ≤ snapshot | Sales activities count |
+| 16 | `days_since_last_touch` | float | events ≤ snapshot | Natural NaN when no touches |
+| 17 | `total_touches_all` | integer | **ALL events** | ⚠️ LEAKAGE TRAP — uses full 90-day window |
+| 18 | `converted` | binary 0/1 | target | Converted within 90 days |
+
+Total: 17 features + 1 target = 18 columns.
+
+## Non-goals for v4
+
+- v4 does NOT require engine changes to the simulation loop itself (stage transitions, churn, conversion hazard).
+- v4 does NOT change the relational bundle format or task splits.
+- v4 does NOT require a new recipe — it uses `b2b_saas_procurement_v1` with adjusted difficulty tuning.
+- v4 does NOT need to change the `student_public` / `research_instructor` exposure modes.
diff --git a/docs/v4/validation_spec.md b/docs/v4/validation_spec.md
new file mode 100644
index 0000000..4c79c62
--- /dev/null
+++ b/docs/v4/validation_spec.md
@@ -0,0 +1,108 @@
+# v4 Validation Specification
+
+## Overview
+
+v4 validation operates at two levels:
+1. **Engine-level validation** (existing `leadforge validate` harness) — structural checks on bundles.
+2. **Dataset-level validation** (new `scripts/validate_v4_dataset.py`) — checks specific to the simplified CSV output.
+
+This document specifies the dataset-level validation for v4.
+
+---
+
+## Mandatory checks
+
+### Check 1: No banned columns
+
+The CSV must NOT contain any of:
+- `current_stage`, `funnel_stage` — outcome-stage leakage
+- `conversion_timestamp` — direct outcome
+- `is_sql` — engine invariant creates deterministic groups
+- `is_mql` — zero variance
+- `lead_created_at` — timestamp that could be used to reverse-engineer temporal info
+- Any column containing `_id` suffix (opaque identifiers, not features)
+
+**Implementation:** Set intersection check on column names.
+
+### Check 2: No deterministic feature groups
+
+For every feature (categorical AND binary), for every value with n ≥ 50:
+- Conversion rate must be in [0.02, 0.98].
+
+This catches:
+- `reached_sql=0` → 0% (caught in v2)
+- `has_opportunity=1` → 0% (caught in v2)
+- Any future deterministic pattern
+
+**Implementation:** `groupby(feature)[target].agg(['mean', 'count'])`, filter to count ≥ 50, check bounds.
+
+### Check 3: Conversion rate realism
+
+- Overall conversion rate must be in [0.15, 0.40].
+
+### Check 4: Baseline model AUC (without leakage trap)
+
+- Train a logistic regression on all features EXCEPT `total_touches_all`.
+- AUC must be in [0.65, 0.90].
+- If AUC < 0.65: features lack signal (category effects too flat).
+- If AUC > 0.90: likely residual leakage.
+
+### Check 5: Leakage trap effectiveness
+
+- Train a logistic regression with ALL features including `total_touches_all`.
+- AUC must be at least 0.03 higher than the clean model from Check 4.
+- If the trap doesn't boost AUC, it's not an effective teaching tool.
+
+### Check 6: Missingness structure
+
+- `web_sessions` must have nulls.
+- Missing rate for `web_sessions` among `sdr_outbound` leads must be > 3× the rate among `inbound_marketing` leads.
+- `seniority` must have nulls.
+- Missing rate for `seniority` among `partner_referral` leads must be > 3× the rate among non-`partner_referral` leads.
+- `days_since_last_touch` must have nulls.
+- No column should have > 20% missing.
+
+### Check 7: Shape constraints
+
+- Exactly 1,000 rows.
+- 18 columns (17 features + 1 target).
+
+### Check 8: Reproducibility
+
+- Running the build script twice with the same seed produces identical output (byte-level CSV comparison).
+
+---
+
+## Warning checks (non-fatal)
+
+### Warning 1: Leakage trap is labeled
+
+- Check that the feature dictionary (if present) marks `total_touches_all` with `leakage_risk: True`.
+
+### Warning 2: Column redundancy
+
+- Warn if `inbound_touches + outbound_touches` correlates > 0.99 with any other column.
+
+### Warning 3: Low-variance features
+
+- Warn if any feature has < 3 unique values (excluding binary features).
+
+---
+
+## Integration with existing validation
+
+The engine-level `leadforge validate` harness (`validation/bundle_checks.py`) continues to validate the full Parquet bundle. The v4 dataset validator is a separate script for the simplified CSV output.
+
+If engine changes add new features to `LEAD_SNAPSHOT_FEATURES`, the existing `validation/realism.py` checks (non-negative counts, valid booleans, stage diversity) automatically cover them.
+
+---
+
+## Validator script interface
+
+```bash
+python scripts/validate_v4_dataset.py lead_scoring_intro/lead_scoring_intro_v4.csv
+```
+
+Exit code 0 = all mandatory checks pass. Exit code 1 = at least one failure.
+
+Output format: structured report showing each check name, status (PASS/FAIL/WARN), and details.