leadforge-dev · shaypal5 · Apr 29, 2026 · Apr 29, 2026 · Copilot · Apr 29, 2026
diff --git a/.agent-plan.md b/.agent-plan.md
@@ -10,22 +10,90 @@
 
 ---
 
-## Next Up — Milestone 12: CLI polish + JSON output (v0.5.0)
+## Next Up — v4 Lead Scoring Dataset
 
-Goal: Polish CLI commands with JSON output mode, richer help text, and progress feedback.
+The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes followed by dataset build scripts.
 
-- [ ] Add `--json` flag to `inspect` and `validate` for machine-readable output
-- [ ] Add `--strict` flag to `validate` to control whether realism checks are errors vs warnings
-- [ ] Improve CLI help text and error messages
-- [ ] Tests for JSON output mode
+See `docs/v4/implementation_plan.md` for full details.
+
+### v4-M0: Requirements + planning ⬜ (this PR)
+
+- [x] `docs/v4/lead_scoring_v4_requirements.md`
+- [x] `docs/v4/dataset_contract.md`
+- [x] `docs/v4/validation_spec.md`
+- [x] `docs/v4/engine_changes_spec.md`
+- [x] `docs/v4/implementation_plan.md`
+- [x] Updated `CLAUDE.md` with repo map + generation commands
+- [x] Updated `AGENTS.md` with v4 implementation guide
+- [x] Updated `.agent-plan.md` (this file)
+
+### v4-M1: Engine — category signal + windowed snapshots ⬜
+
+- [ ] Add `category_effect_scale` to difficulty profiles
+- [ ] Apply scale in `mechanisms/policies.py`
+- [ ] Add `snapshot_day` parameter to `render/snapshots.py`
+- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv`
+- [ ] Add new `FeatureSpec` entries to `schema/features.py`
+- [ ] Tests for all changes
+- [ ] Verify category spread ≥15% for key features at intro difficulty
+
+### v4-M2: Build pipeline + validation ⬜
+
+- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness
+- [ ] `scripts/validate_v4_dataset.py` — full validation per `docs/v4/validation_spec.md`
+- [ ] Generate test dataset and verify all checks pass
+- [ ] LR AUC 0.65–0.90 (without trap); ≥0.03 boost with trap
+
+### v4-M3: Documentation + release ⬜
+
+- [ ] Generate `lead_scoring_intro_v4.csv` (in datasets-private repo)
+- [ ] Write `RELEASE_v4.md`
+- [ ] Update dataset repo README
+- [ ] Update `.agent-plan.md` to reflect completion
+
+---
+
+## Deferred Items
+
+### From existing roadmap (M12–M15)
+
+| Item | Status | Rationale |
+|---|---|---|
+| M12: CLI `--json` flag | Deferred | No consumer needs it yet; add post-v4 |
+| M12: CLI `--strict` flag | Deferred | Per-check control is better than global flag |
+| M12: CLI help text polish | Deferred | Low priority vs dataset |
+| M14: Sample bundle commit | Absorbed into v4-M3 | v4 dataset IS the sample |
+| M14: Notebook 1 (inspecting world) | Deferred | Do after v4 ships |
+| M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this |
+| M14: Notebook 3 (public vs instructor) | Discarded | No current audience |
+| M14: Notebook 4 (recipe customization) | Discarded | Premature |
+| M15: Docs polish + v1.0 RC | Deferred | Do after v4 ships |
+
+### From post-v1 list
+
+- Second vertical
+- LTV labels as first-class task outputs
+- Continuous-time / richer event engine
+- Plugin architecture
+- External-API enrichment
+- Web UI or dashboard
+- Engine fix: `is_sql=False` → never converts (deterministic invariant)
 
 ---
 
 ## Context Pointers
 
-- Milestone 12 scope: `docs/leadforge_implementation_plan.md` §10 "Milestone 12"
+- v4 requirements: `docs/v4/lead_scoring_v4_requirements.md`
+- v4 dataset contract: `docs/v4/dataset_contract.md`
+- v4 engine changes: `docs/v4/engine_changes_spec.md`
+- v4 validation spec: `docs/v4/validation_spec.md`
+- v4 implementation plan: `docs/v4/implementation_plan.md`
+- Existing roadmap: `docs/leadforge_implementation_plan.md`
 - CLI commands: `leadforge/cli/commands/`
 - Validation modules: `leadforge/validation/`
+- Snapshot builder: `leadforge/render/snapshots.py`
+- Mechanism policy: `leadforge/mechanisms/policies.py`
+- Difficulty profiles: `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml`
 
 ---
 
@@ -156,16 +224,3 @@ Goal: Polish CLI commands with JSON output mode, richer help text, and progress
 - `leadforge/recipes/`: registry + `b2b_saas_procurement_v1/recipe.yaml`
 - `.github/workflows/ci.yml`: lint, typecheck, test matrix (3.11 + 3.12) with coverage upload
 - 20 tests passing; ruff + mypy clean
-
----
-
-## Deferred (Post-v1)
-
-- Second vertical
-- LTV labels as first-class task outputs
-- Continuous-time / richer event engine
-- Plugin architecture
-- External-API enrichment
-- Web UI or dashboard
-
-See `docs/leadforge_implementation_plan.md` §10 for the full deferral list.
diff --git a/AGENTS.md b/AGENTS.md
@@ -25,3 +25,78 @@ Do **not** leave threads unresolved after the commit is pushed.
 ## Branch & PR Conventions
 
 See CLAUDE.md for the full mandatory branch/PR workflow (branch → commit → update `.agent-plan.md` → open PR).
+
+---
+
+## v4 Implementation Guide
+
+### What is v4?
+
+A pedagogically improved lead scoring dataset (single CSV) for an intro ML course. The engine changes are small and targeted. See `docs/v4/` for full specs.
+
+### Implementation order
+
+```
+v4-M0 (planning PR — already done)
-v4-M0 (planning PR — already done)
+v4-M0: planning (this PR)
-v4-M0 (planning PR — already done)
+v4-M0: planning (this PR)
+  └── v4-M1: engine changes (category signal + windowed snapshots)
+        └── v4-M2: build pipeline + validation scripts
+              └── v4-M3: dataset generation + release docs
+```
+
+### Key files to modify per milestone
+
+**v4-M1 (engine):**
+- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale`
+- `leadforge/mechanisms/policies.py` — apply scale to categorical influences
+- `leadforge/render/snapshots.py` — add `snapshot_day` param, windowed aggregation
+- `leadforge/schema/features.py` — add new FeatureSpec entries
+- Tests in `tests/mechanisms/` and `tests/render/`
+
+**v4-M2 (build pipeline):**
+- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
+- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
+- These live in the leadforge repo (not datasets-private)
+
+**v4-M3 (release):**
+- Work in `leadforge-datasets-private` repo
- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
- These live in the leadforge repo (not datasets-private)
-
-**v4-M3 (release):**
- Work in `leadforge-datasets-private` repo
+- Work in `leadforge-datasets-private` repo
+- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
+- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
+
+**v4-M3 (release):**
+- Continue in `leadforge-datasets-private` repo for dataset generation and release artifacts
- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
- These live in the leadforge repo (not datasets-private)
-
-**v4-M3 (release):**
- Work in `leadforge-datasets-private` repo
+- Work in `leadforge-datasets-private` repo
+- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
+- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
+
+**v4-M3 (release):**
+- Continue in `leadforge-datasets-private` repo for dataset generation and release artifacts
+- `lead_scoring_intro/lead_scoring_intro_v4.csv`
+- `lead_scoring_intro/RELEASE_v4.md`
+
+### Coding conventions for v4
+
+1. **Backward compatibility:** All engine changes must default to current behavior. New parameters must have defaults that produce identical output when unset.
+2. **No simulation loop changes:** Do not modify the daily step logic in `engine.py`. v4 changes are in mechanism weights and snapshot rendering only.
+3. **Temporal correctness:** Every feature computation must be explicitly gated by snapshot day. Use `event_timestamp <= lead_created_at + snapshot_day` — never `<`.
+4. **Test coverage:** Every new parameter and feature must have unit tests. Test both `snapshot_day=None` (backward compat) and `snapshot_day=21` (v4 mode).
+5. **Determinism:** All new stochastic operations must use seeded RNG. Verify with a determinism test (same seed → identical output).
+
+### Validation checklist for v4 dataset
+
+Before declaring v4-M2 complete, the dataset must pass:
+
+- [ ] 1,000 rows, 18 columns
+- [ ] 30% conversion rate (±1%)
+- [ ] No deterministic groups (n≥50 at 0% or 100% conversion)
+- [ ] LR AUC 0.65–0.90 (without leakage trap)
+- [ ] LR AUC boost ≥0.03 when leakage trap included
+- [ ] `web_sessions` missingness: outbound rate > 3× inbound rate
+- [ ] `seniority` missingness: partner_referral rate > 3× others
+- [ ] Reproducible with seed 42
+- [ ] `total_touches_all` uses full 90-day data (confirmed by AUC boost)
+
+### How to test engine changes locally
+
+```bash
+# Quick smoke test: generate a small bundle and inspect
+leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 --difficulty intro --n-leads 1000 --out /tmp/test_bundle
+leadforge validate /tmp/test_bundle
+
+# Check category signal spread
+python -c "
+import pandas as pd
+df = pd.read_parquet('/tmp/test_bundle/tasks/converted_within_90_days/train.parquet')
+for col in ['role_function', 'seniority', 'estimated_revenue_band']:
+    rates = df.groupby(col)['converted_within_90_days'].mean()
+    print(f'{col}: spread={rates.max()-rates.min():.1%}')
+"
+```
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -212,7 +212,153 @@ Key abstractions: `Recipe`, `GenerationConfig`, `WorldSpec`, `WorldBundle`, `Exp
 
 ---
 
+## Repository Map
+
+```
+leadforge/                    # Python package root
+├── api/                      # Public API: Generator, Recipe, Bundle
+│   ├── generator.py          # Generator.from_recipe() → .generate() → WorldBundle
+│   ├── recipes.py            # Recipe loading, config resolution
+│   └── bundle.py             # write_bundle() orchestrator
+├── cli/                      # Click CLI
+│   ├── main.py               # CLI entry point
+│   └── commands/             # generate, inspect, validate, list_recipes
+├── core/                     # Foundational utilities
+│   ├── rng.py                # RNGRoot with named substreams
+│   ├── ids.py                # Deterministic ID generation (acct_000001, etc.)
+│   ├── models.py             # GenerationConfig, WorldSpec, WorldBundle
+│   ├── enums.py              # ExposureMode, DifficultyProfile
+│   └── exceptions.py         # Custom exception hierarchy
+├── narrative/                # Vertical narrative (company, market, personas)
+│   ├── spec.py               # NarrativeSpec and sub-spec dataclasses
+│   └── dataset_card.py       # Markdown dataset card renderer
+├── schema/                   # Relational data model
+│   ├── entities.py           # 9 entity row dataclasses (AccountRow, LeadRow, etc.)
+│   ├── features.py           # LEAD_SNAPSHOT_FEATURES — canonical feature spec
+│   ├── relationships.py      # FK constraints (ALL_CONSTRAINTS)
+│   ├── tasks.py              # SplitSpec, TaskManifest, CONVERTED_WITHIN_90_DAYS
+│   └── dictionaries.py       # Feature dictionary CSV writer
+├── structure/                # Hidden world graph
+│   ├── graph.py              # WorldGraph (DAG wrapper)
+│   ├── motifs.py             # 5 motif families
+│   ├── rewiring.py           # Stochastic graph perturbation
+│   └── sampler.py            # sample_hidden_graph()
+├── mechanisms/               # Node/edge behavior
+│   ├── policies.py           # assign_mechanisms() — motif → MechanismAssignment
+│   ├── hazards.py            # ConversionHazard
+│   ├── transitions.py        # StageSequence, HazardTransition
+│   ├── counts.py             # PoissonIntensity, RecencyDecayIntensity
+│   ├── categorical.py        # CategoricalInfluence, CHANNEL_QUALITY_SCORES
+│   └── scores.py             # LatentScore
+├── simulation/               # World evolution
+│   ├── engine.py             # simulate_world() — 90-day daily loop
+│   ├── state.py              # LeadSimState (per-lead mutable state)
+│   └── population.py         # build_population() — accounts, contacts, leads
+├── render/                   # Bundle output
+│   ├── snapshots.py          # build_snapshot() — ML-ready lead table
+│   ├── relational.py         # to_dataframes() — 9-table dict
+│   ├── tasks.py              # write_task_splits() — train/valid/test Parquet
+│   └── manifests.py          # build_manifest(), write_manifest()
+├── exposure/                 # Truth filtering
+│   ├── modes.py              # apply_exposure() dispatch
+│   ├── metadata.py           # write_metadata_dir() for instructor mode
+│   └── filters.py            # BundleFilter, FILTERS dict
+├── validation/               # Bundle quality checks
+│   ├── bundle_checks.py      # validate_bundle() orchestrator
+│   ├── invariants.py         # Determinism + exposure monotonicity
+│   ├── realism.py            # Conversion rates, feature ranges, stage diversity
+│   ├── difficulty.py         # Known difficulty profile validation
+│   └── drift.py              # Cross-seed stability
+└── recipes/                  # Recipe definitions
+    └── b2b_saas_procurement_v1/
+        ├── recipe.yaml       # Recipe metadata + defaults
+        ├── narrative.yaml    # Company, product, market, personas, funnel
+        └── difficulty_profiles.yaml  # intro/intermediate/advanced
+```
+
+### Related repos
+
+- **leadforge-datasets-private** — generated dataset archive
+  - `b2b_saas_procurement_v1__intro__seed42/` — full relational bundle
+  - `lead_scoring_intro/` — simplified single-CSV versions (v1–v4)
+  - `scripts/` — build and validation scripts for simplified CSVs
-  - `scripts/` — build and validation scripts for simplified CSVs
+  - Does **not** serve as the source of truth for simplified-CSV build/validation scripts; those live in the **leadforge** repo.
-  - `scripts/` — build and validation scripts for simplified CSVs
+  - Does **not** serve as the source of truth for simplified-CSV build/validation scripts; those live in the **leadforge** repo.
+
+---
+
+## Generation Workflow
+
+### Generate a full bundle
+
+```bash
+leadforge generate \
+  --recipe b2b_saas_procurement_v1 \
+  --seed 42 \
+  --mode student_public \
+  --difficulty intro \
+  --n-leads 5000 \
+  --out ./out/bundle
+```
+
+### Build a simplified CSV (v4 example)
+
+```bash
+# In leadforge-datasets-private repo:
+python scripts/build_v4_snapshot.py /path/to/bundle lead_scoring_intro/lead_scoring_intro_v4.csv
+```
+
+### Validate a simplified CSV
+
+```bash
+python scripts/validate_v4_dataset.py lead_scoring_intro/lead_scoring_intro_v4.csv
+```
+
+### Validate a full bundle
+
+```bash
+leadforge validate ./out/bundle
+```
+
+---
+
+## student_public Mode Invariants
+
+These are non-negotiable for any dataset published in `student_public` mode:
+
+1. **No post-snapshot features** — all features computed from events ≤ snapshot day only.
+2. **No outcome-stage columns** — `current_stage`, `funnel_stage` with `closed_won`/`closed_lost` are banned.
+3. **No deterministic single-feature mapping** — for any feature value with n≥50, conversion rate must be in [2%, 98%].
+4. **No hidden truth** — latent scores, mechanism parameters, world graph not included.
+5. **No direct outcome columns** — `conversion_timestamp`, `close_outcome` are banned.
+6. **No zero-variance features** — every included feature must have ≥2 distinct values.
+
+Exception: deliberately included leakage traps (e.g., `total_touches_all` in v4) must be clearly documented in release notes and feature dictionary.
+
+---
+
+## How to Add New Features to the Snapshot
+
+1. Add a `FeatureSpec` entry to `LEAD_SNAPSHOT_FEATURES` in `leadforge/schema/features.py`.
+2. Compute the feature value in `build_snapshot()` in `leadforge/render/snapshots.py`.
+3. If the feature needs new event data, add it to the simulation loop in `leadforge/simulation/engine.py`.
+4. Update `leadforge/schema/dictionaries.py` if the feature dictionary format changes.
+5. Run `pytest` and `leadforge validate` on a generated bundle.
+6. Update the feature dictionary CSV description.
+
+---
+
+## v4 Dataset Plan
+
+The current focus is producing a v4 lead scoring intro dataset. See `docs/v4/` for:
+- `lead_scoring_v4_requirements.md` — what v4 must achieve
+- `dataset_contract.md` — schema contract and temporal gates
+- `engine_changes_spec.md` — what changes in the engine
+- `validation_spec.md` — automated validation checks
+- `implementation_plan.md` — milestone breakdown
+
+---
+
 ## Reference Docs
 - Design decisions: `docs/leadforge_design_doc.md`
 - Architecture/spec: `docs/leadforge_architecture_spec.md`
 - Implementation roadmap: `docs/leadforge_implementation_plan.md`
+- v4 dataset plan: `docs/v4/implementation_plan.md`