Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 75 additions & 20 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,22 +10,90 @@

---

## Next Up — Milestone 12: CLI polish + JSON output (v0.5.0)
## Next Up — v4 Lead Scoring Dataset

Goal: Polish CLI commands with JSON output mode, richer help text, and progress feedback.
The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes followed by dataset build scripts.

- [ ] Add `--json` flag to `inspect` and `validate` for machine-readable output
- [ ] Add `--strict` flag to `validate` to control whether realism checks are errors vs warnings
- [ ] Improve CLI help text and error messages
- [ ] Tests for JSON output mode
See `docs/v4/implementation_plan.md` for full details.

### v4-M0: Requirements + planning ⬜ (this PR)

- [x] `docs/v4/lead_scoring_v4_requirements.md`
- [x] `docs/v4/dataset_contract.md`
- [x] `docs/v4/validation_spec.md`
- [x] `docs/v4/engine_changes_spec.md`
- [x] `docs/v4/implementation_plan.md`
- [x] Updated `CLAUDE.md` with repo map + generation commands
- [x] Updated `AGENTS.md` with v4 implementation guide
- [x] Updated `.agent-plan.md` (this file)
Comment on lines +19 to +28

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v4-M0 is labeled as ⬜, but every deliverable underneath is checked off. Consider marking v4-M0 as complete (✅/done) once this PR lands, or changing the wording to indicate it’s “in review” to keep status signaling consistent.

Copilot uses AI. Check for mistakes.

### v4-M1: Engine — category signal + windowed snapshots ⬜

- [ ] Add `category_effect_scale` to difficulty profiles
- [ ] Apply scale in `mechanisms/policies.py`
- [ ] Add `snapshot_day` parameter to `render/snapshots.py`
- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv`
- [ ] Add new `FeatureSpec` entries to `schema/features.py`
- [ ] Tests for all changes
- [ ] Verify category spread ≥15% for key features at intro difficulty

### v4-M2: Build pipeline + validation ⬜

- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness
- [ ] `scripts/validate_v4_dataset.py` — full validation per `docs/v4/validation_spec.md`
- [ ] Generate test dataset and verify all checks pass
- [ ] LR AUC 0.65–0.90 (without trap); ≥0.03 boost with trap

### v4-M3: Documentation + release ⬜

- [ ] Generate `lead_scoring_intro_v4.csv` (in datasets-private repo)
- [ ] Write `RELEASE_v4.md`
- [ ] Update dataset repo README
- [ ] Update `.agent-plan.md` to reflect completion

---

## Deferred Items

### From existing roadmap (M12–M15)

| Item | Status | Rationale |
|---|---|---|
| M12: CLI `--json` flag | Deferred | No consumer needs it yet; add post-v4 |
| M12: CLI `--strict` flag | Deferred | Per-check control is better than global flag |
| M12: CLI help text polish | Deferred | Low priority vs dataset |
| M14: Sample bundle commit | Absorbed into v4-M3 | v4 dataset IS the sample |
| M14: Notebook 1 (inspecting world) | Deferred | Do after v4 ships |
| M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this |
| M14: Notebook 3 (public vs instructor) | Discarded | No current audience |
| M14: Notebook 4 (recipe customization) | Discarded | Premature |
| M15: Docs polish + v1.0 RC | Deferred | Do after v4 ships |

### From post-v1 list

- Second vertical
- LTV labels as first-class task outputs
- Continuous-time / richer event engine
- Plugin architecture
- External-API enrichment
- Web UI or dashboard
- Engine fix: `is_sql=False` → never converts (deterministic invariant)

---

## Context Pointers

- Milestone 12 scope: `docs/leadforge_implementation_plan.md` §10 "Milestone 12"
- v4 requirements: `docs/v4/lead_scoring_v4_requirements.md`
- v4 dataset contract: `docs/v4/dataset_contract.md`
- v4 engine changes: `docs/v4/engine_changes_spec.md`
- v4 validation spec: `docs/v4/validation_spec.md`
- v4 implementation plan: `docs/v4/implementation_plan.md`
- Existing roadmap: `docs/leadforge_implementation_plan.md`
- CLI commands: `leadforge/cli/commands/`
- Validation modules: `leadforge/validation/`
- Snapshot builder: `leadforge/render/snapshots.py`
- Mechanism policy: `leadforge/mechanisms/policies.py`
- Difficulty profiles: `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml`

---

Expand Down Expand Up @@ -156,16 +224,3 @@ Goal: Polish CLI commands with JSON output mode, richer help text, and progress
- `leadforge/recipes/`: registry + `b2b_saas_procurement_v1/recipe.yaml`
- `.github/workflows/ci.yml`: lint, typecheck, test matrix (3.11 + 3.12) with coverage upload
- 20 tests passing; ruff + mypy clean

---

## Deferred (Post-v1)

- Second vertical
- LTV labels as first-class task outputs
- Continuous-time / richer event engine
- Plugin architecture
- External-API enrichment
- Web UI or dashboard

See `docs/leadforge_implementation_plan.md` §10 for the full deferral list.
75 changes: 75 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,78 @@ Do **not** leave threads unresolved after the commit is pushed.
## Branch & PR Conventions

See CLAUDE.md for the full mandatory branch/PR workflow (branch → commit → update `.agent-plan.md` → open PR).

---

## v4 Implementation Guide

### What is v4?

A pedagogically improved lead scoring dataset (single CSV) for an intro ML course. The engine changes are small and targeted. See `docs/v4/` for full specs.

### Implementation order

```
v4-M0 (planning PR — already done)

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation order diagram says “v4-M0 (planning PR — already done)”, but in this PR v4-M0 is the work being proposed/merged. Consider rephrasing to something time-invariant (e.g., “v4-M0: planning (this PR)”) to avoid the docs disagreeing with .agent-plan.md.

Suggested change
v4-M0 (planning PR — already done)
v4-M0: planning (this PR)

Copilot uses AI. Check for mistakes.
└── v4-M1: engine changes (category signal + windowed snapshots)
└── v4-M2: build pipeline + validation scripts
└── v4-M3: dataset generation + release docs
```

### Key files to modify per milestone

**v4-M1 (engine):**
- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale`
- `leadforge/mechanisms/policies.py` — apply scale to categorical influences
- `leadforge/render/snapshots.py` — add `snapshot_day` param, windowed aggregation
- `leadforge/schema/features.py` — add new FeatureSpec entries
- Tests in `tests/mechanisms/` and `tests/render/`

**v4-M2 (build pipeline):**
- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
- These live in the leadforge repo (not datasets-private)

**v4-M3 (release):**
- Work in `leadforge-datasets-private` repo
Comment on lines +56 to +61

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section states the v4 build/validation scripts live in the leadforge repo (not datasets-private), but CLAUDE.md’s “Related repos” + workflow example place these scripts in leadforge-datasets-private. Please align the docs on a single canonical home (or document how scripts are shared) to avoid contributors building in the wrong repo.

Suggested change
- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
- These live in the leadforge repo (not datasets-private)
**v4-M3 (release):**
- Work in `leadforge-datasets-private` repo
- Work in `leadforge-datasets-private` repo
- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
**v4-M3 (release):**
- Continue in `leadforge-datasets-private` repo for dataset generation and release artifacts

Copilot uses AI. Check for mistakes.
- `lead_scoring_intro/lead_scoring_intro_v4.csv`
- `lead_scoring_intro/RELEASE_v4.md`

### Coding conventions for v4

1. **Backward compatibility:** All engine changes must default to current behavior. New parameters must have defaults that produce identical output when unset.
2. **No simulation loop changes:** Do not modify the daily step logic in `engine.py`. v4 changes are in mechanism weights and snapshot rendering only.
3. **Temporal correctness:** Every feature computation must be explicitly gated by snapshot day. Use `event_timestamp <= lead_created_at + snapshot_day` — never `<`.
4. **Test coverage:** Every new parameter and feature must have unit tests. Test both `snapshot_day=None` (backward compat) and `snapshot_day=21` (v4 mode).
5. **Determinism:** All new stochastic operations must use seeded RNG. Verify with a determinism test (same seed → identical output).

### Validation checklist for v4 dataset

Before declaring v4-M2 complete, the dataset must pass:

- [ ] 1,000 rows, 18 columns
- [ ] 30% conversion rate (±1%)
- [ ] No deterministic groups (n≥50 at 0% or 100% conversion)
- [ ] LR AUC 0.65–0.90 (without leakage trap)
- [ ] LR AUC boost ≥0.03 when leakage trap included
- [ ] `web_sessions` missingness: outbound rate > 3× inbound rate
- [ ] `seniority` missingness: partner_referral rate > 3× others
- [ ] Reproducible with seed 42
- [ ] `total_touches_all` uses full 90-day data (confirmed by AUC boost)

### How to test engine changes locally

```bash
# Quick smoke test: generate a small bundle and inspect
leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 --difficulty intro --n-leads 1000 --out /tmp/test_bundle
leadforge validate /tmp/test_bundle

# Check category signal spread
python -c "
import pandas as pd
df = pd.read_parquet('/tmp/test_bundle/tasks/converted_within_90_days/train.parquet')
for col in ['role_function', 'seniority', 'estimated_revenue_band']:
rates = df.groupby(col)['converted_within_90_days'].mean()
print(f'{col}: spread={rates.max()-rates.min():.1%}')
"
```
146 changes: 146 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,153 @@ Key abstractions: `Recipe`, `GenerationConfig`, `WorldSpec`, `WorldBundle`, `Exp

---

## Repository Map

```
leadforge/ # Python package root
├── api/ # Public API: Generator, Recipe, Bundle
│ ├── generator.py # Generator.from_recipe() → .generate() → WorldBundle
│ ├── recipes.py # Recipe loading, config resolution
│ └── bundle.py # write_bundle() orchestrator
├── cli/ # Click CLI
│ ├── main.py # CLI entry point
│ └── commands/ # generate, inspect, validate, list_recipes
├── core/ # Foundational utilities
│ ├── rng.py # RNGRoot with named substreams
│ ├── ids.py # Deterministic ID generation (acct_000001, etc.)
│ ├── models.py # GenerationConfig, WorldSpec, WorldBundle
│ ├── enums.py # ExposureMode, DifficultyProfile
│ └── exceptions.py # Custom exception hierarchy
├── narrative/ # Vertical narrative (company, market, personas)
│ ├── spec.py # NarrativeSpec and sub-spec dataclasses
│ └── dataset_card.py # Markdown dataset card renderer
├── schema/ # Relational data model
│ ├── entities.py # 9 entity row dataclasses (AccountRow, LeadRow, etc.)
│ ├── features.py # LEAD_SNAPSHOT_FEATURES — canonical feature spec
│ ├── relationships.py # FK constraints (ALL_CONSTRAINTS)
│ ├── tasks.py # SplitSpec, TaskManifest, CONVERTED_WITHIN_90_DAYS
│ └── dictionaries.py # Feature dictionary CSV writer
├── structure/ # Hidden world graph
│ ├── graph.py # WorldGraph (DAG wrapper)
│ ├── motifs.py # 5 motif families
│ ├── rewiring.py # Stochastic graph perturbation
│ └── sampler.py # sample_hidden_graph()
├── mechanisms/ # Node/edge behavior
│ ├── policies.py # assign_mechanisms() — motif → MechanismAssignment
│ ├── hazards.py # ConversionHazard
│ ├── transitions.py # StageSequence, HazardTransition
│ ├── counts.py # PoissonIntensity, RecencyDecayIntensity
│ ├── categorical.py # CategoricalInfluence, CHANNEL_QUALITY_SCORES
│ └── scores.py # LatentScore
├── simulation/ # World evolution
│ ├── engine.py # simulate_world() — 90-day daily loop
│ ├── state.py # LeadSimState (per-lead mutable state)
│ └── population.py # build_population() — accounts, contacts, leads
├── render/ # Bundle output
│ ├── snapshots.py # build_snapshot() — ML-ready lead table
│ ├── relational.py # to_dataframes() — 9-table dict
│ ├── tasks.py # write_task_splits() — train/valid/test Parquet
│ └── manifests.py # build_manifest(), write_manifest()
├── exposure/ # Truth filtering
│ ├── modes.py # apply_exposure() dispatch
│ ├── metadata.py # write_metadata_dir() for instructor mode
│ └── filters.py # BundleFilter, FILTERS dict
├── validation/ # Bundle quality checks
│ ├── bundle_checks.py # validate_bundle() orchestrator
│ ├── invariants.py # Determinism + exposure monotonicity
│ ├── realism.py # Conversion rates, feature ranges, stage diversity
│ ├── difficulty.py # Known difficulty profile validation
│ └── drift.py # Cross-seed stability
└── recipes/ # Recipe definitions
└── b2b_saas_procurement_v1/
├── recipe.yaml # Recipe metadata + defaults
├── narrative.yaml # Company, product, market, personas, funnel
└── difficulty_profiles.yaml # intro/intermediate/advanced
```

### Related repos

- **leadforge-datasets-private** — generated dataset archive
- `b2b_saas_procurement_v1__intro__seed42/` — full relational bundle
- `lead_scoring_intro/` — simplified single-CSV versions (v1–v4)
- `scripts/` — build and validation scripts for simplified CSVs

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says the simplified-CSV build/validation scripts live under the leadforge-datasets-private repo (scripts/ there), but AGENTS.md indicates these scripts should live in the leadforge repo. Please align the docs on the canonical location (or document a clear “source-of-truth” + copy/vendoring workflow).

Suggested change
- `scripts/` — build and validation scripts for simplified CSVs
- Does **not** serve as the source of truth for simplified-CSV build/validation scripts; those live in the **leadforge** repo.

Copilot uses AI. Check for mistakes.

---

## Generation Workflow

### Generate a full bundle

```bash
leadforge generate \
--recipe b2b_saas_procurement_v1 \
--seed 42 \
--mode student_public \
--difficulty intro \
--n-leads 5000 \
--out ./out/bundle
```

### Build a simplified CSV (v4 example)

```bash
# In leadforge-datasets-private repo:
python scripts/build_v4_snapshot.py /path/to/bundle lead_scoring_intro/lead_scoring_intro_v4.csv
```
Comment on lines +304 to +307

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow example runs python scripts/build_v4_snapshot.py ... “in leadforge-datasets-private”, which conflicts with AGENTS.md guidance that these scripts live in the leadforge repo. Please reconcile where contributors should implement/run these scripts to prevent drift between repos.

Copilot uses AI. Check for mistakes.

### Validate a simplified CSV

```bash
python scripts/validate_v4_dataset.py lead_scoring_intro/lead_scoring_intro_v4.csv
```

### Validate a full bundle

```bash
leadforge validate ./out/bundle
```

---

## student_public Mode Invariants

These are non-negotiable for any dataset published in `student_public` mode:

1. **No post-snapshot features** — all features computed from events ≤ snapshot day only.
2. **No outcome-stage columns** — `current_stage`, `funnel_stage` with `closed_won`/`closed_lost` are banned.
3. **No deterministic single-feature mapping** — for any feature value with n≥50, conversion rate must be in [2%, 98%].
4. **No hidden truth** — latent scores, mechanism parameters, world graph not included.
5. **No direct outcome columns** — `conversion_timestamp`, `close_outcome` are banned.
6. **No zero-variance features** — every included feature must have ≥2 distinct values.

Exception: deliberately included leakage traps (e.g., `total_touches_all` in v4) must be clearly documented in release notes and feature dictionary.

---

## How to Add New Features to the Snapshot

1. Add a `FeatureSpec` entry to `LEAD_SNAPSHOT_FEATURES` in `leadforge/schema/features.py`.
2. Compute the feature value in `build_snapshot()` in `leadforge/render/snapshots.py`.
3. If the feature needs new event data, add it to the simulation loop in `leadforge/simulation/engine.py`.
4. Update `leadforge/schema/dictionaries.py` if the feature dictionary format changes.
5. Run `pytest` and `leadforge validate` on a generated bundle.
6. Update the feature dictionary CSV description.

---

## v4 Dataset Plan

The current focus is producing a v4 lead scoring intro dataset. See `docs/v4/` for:
- `lead_scoring_v4_requirements.md` — what v4 must achieve
- `dataset_contract.md` — schema contract and temporal gates
- `engine_changes_spec.md` — what changes in the engine
- `validation_spec.md` — automated validation checks
- `implementation_plan.md` — milestone breakdown

---

## Reference Docs
- Design decisions: `docs/leadforge_design_doc.md`
- Architecture/spec: `docs/leadforge_architecture_spec.md`
- Implementation roadmap: `docs/leadforge_implementation_plan.md`
- v4 dataset plan: `docs/v4/implementation_plan.md`
Loading
Loading