Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 24 additions & 29 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,39 +12,35 @@

## Next Up — v4 Lead Scoring Dataset

The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes followed by dataset build scripts.
The primary focus is producing a v4 lead scoring dataset that fixes the issues found in v1–v3 datasets. This requires targeted engine changes + a build pipeline, followed by dataset release.

See `docs/v4/implementation_plan.md` for full details.
See `docs/v4/design.md` for full details.

### v4-M0: Requirements + planning ⬜ (this PR)
### v4-M0: Planning + spike ⬜ (this PR)

- [x] `docs/v4/lead_scoring_v4_requirements.md`
- [x] `docs/v4/dataset_contract.md`
- [x] `docs/v4/validation_spec.md`
- [x] `docs/v4/engine_changes_spec.md`
- [x] `docs/v4/implementation_plan.md`
- [x] Updated `CLAUDE.md` with repo map + generation commands
- [x] Updated `AGENTS.md` with v4 implementation guide
- [x] Updated `.agent-plan.md` (this file)
- [x] `docs/v4/design.md` — consolidated requirements, contract, engine changes, implementation plan
- [x] `docs/v4/validation_spec.md` — automated validation checks
- [x] `docs/v4/planning_pr_review.md` — self-review and treatment plan
- [x] `scripts/spike_category_signal.py` — spike experiment validating category signal approach
- [x] Updated `CLAUDE.md`, `AGENTS.md`, `.agent-plan.md`

### v4-M1: Engine — category signal + windowed snapshots
### v4-M1: Engine + build pipeline

- [ ] Add `category_effect_scale` to difficulty profiles
- [ ] Apply scale in `mechanisms/policies.py`
- [ ] Add `snapshot_day` parameter to `render/snapshots.py`
- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv`
- [ ] Add new `FeatureSpec` entries to `schema/features.py`
- [ ] Tests for all changes
- [ ] Verify category spread ≥15% for key features at intro difficulty
Engine changes:
- [ ] Add `category_latent_correlations` to `difficulty_profiles.yaml` (intro profile)
- [ ] Apply correlations in `simulation/population.py` after initial latent sampling
- [ ] Add `snapshot_day` parameter to `render/snapshots.py` with windowed aggregation
- [ ] Add new features: `touches_week_1`, `days_since_first_touch`, `expected_acv`, `total_touches_all`
- [ ] Add `FeatureSpec` entries to `schema/features.py`
- [ ] Tests for all engine changes (backward compat + v4 mode)

### v4-M2: Build pipeline + validation ⬜

- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness
Build pipeline:
- [ ] `scripts/build_v4_snapshot.py` — day-21 snapshot + leakage trap + structured missingness + subsampling
- [ ] `scripts/validate_v4_dataset.py` — full validation per `docs/v4/validation_spec.md`
- [ ] Generate test dataset and verify all checks pass
- [ ] End-to-end: generate bundle → build CSV → validate → all checks pass
- [ ] LR AUC 0.65–0.90 (without trap); ≥0.03 boost with trap

### v4-M3: Documentation + release ⬜
### v4-M2: Documentation + release ⬜

- [ ] Generate `lead_scoring_intro_v4.csv` (in datasets-private repo)
- [ ] Write `RELEASE_v4.md`
Expand All @@ -62,7 +58,7 @@ See `docs/v4/implementation_plan.md` for full details.
| M12: CLI `--json` flag | Deferred | No consumer needs it yet; add post-v4 |
| M12: CLI `--strict` flag | Deferred | Per-check control is better than global flag |
| M12: CLI help text polish | Deferred | Low priority vs dataset |
| M14: Sample bundle commit | Absorbed into v4-M3 | v4 dataset IS the sample |
| M14: Sample bundle commit | Absorbed into v4-M2 | v4 dataset IS the sample |
| M14: Notebook 1 (inspecting world) | Deferred | Do after v4 ships |
| M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this |
| M14: Notebook 3 (public vs instructor) | Discarded | No current audience |
Expand All @@ -83,11 +79,10 @@ See `docs/v4/implementation_plan.md` for full details.

## Context Pointers

- v4 requirements: `docs/v4/lead_scoring_v4_requirements.md`
- v4 dataset contract: `docs/v4/dataset_contract.md`
- v4 engine changes: `docs/v4/engine_changes_spec.md`
- v4 design (requirements, contract, engine changes, plan): `docs/v4/design.md`
- v4 validation spec: `docs/v4/validation_spec.md`
- v4 implementation plan: `docs/v4/implementation_plan.md`
- v4 self-review: `docs/v4/planning_pr_review.md`
- Spike experiment: `scripts/spike_category_signal.py`
- Existing roadmap: `docs/leadforge_implementation_plan.md`
- CLI commands: `leadforge/cli/commands/`
- Validation modules: `leadforge/validation/`
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.4.5
rev: v0.11.13
hooks:
- id: ruff
args: [--fix]
Expand Down
73 changes: 2 additions & 71 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,75 +28,6 @@ See CLAUDE.md for the full mandatory branch/PR workflow (branch → commit → u

---

## v4 Implementation Guide
## v4 Implementation

### What is v4?

A pedagogically improved lead scoring dataset (single CSV) for an intro ML course. The engine changes are small and targeted. See `docs/v4/` for full specs.

### Implementation order

```
v4-M0 (planning PR — already done)
└── v4-M1: engine changes (category signal + windowed snapshots)
└── v4-M2: build pipeline + validation scripts
└── v4-M3: dataset generation + release docs
```

### Key files to modify per milestone

**v4-M1 (engine):**
- `leadforge/recipes/b2b_saas_procurement_v1/difficulty_profiles.yaml` — add `category_effect_scale`
- `leadforge/mechanisms/policies.py` — apply scale to categorical influences
- `leadforge/render/snapshots.py` — add `snapshot_day` param, windowed aggregation
- `leadforge/schema/features.py` — add new FeatureSpec entries
- Tests in `tests/mechanisms/` and `tests/render/`

**v4-M2 (build pipeline):**
- `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap
- `scripts/validate_v4_dataset.py` (new) — dataset-level validation
- These live in the leadforge repo (not datasets-private)

**v4-M3 (release):**
- Work in `leadforge-datasets-private` repo
- `lead_scoring_intro/lead_scoring_intro_v4.csv`
- `lead_scoring_intro/RELEASE_v4.md`

### Coding conventions for v4

1. **Backward compatibility:** All engine changes must default to current behavior. New parameters must have defaults that produce identical output when unset.
2. **No simulation loop changes:** Do not modify the daily step logic in `engine.py`. v4 changes are in mechanism weights and snapshot rendering only.
3. **Temporal correctness:** Every feature computation must be explicitly gated by snapshot day. Use `event_timestamp <= lead_created_at + snapshot_day` — never `<`.
4. **Test coverage:** Every new parameter and feature must have unit tests. Test both `snapshot_day=None` (backward compat) and `snapshot_day=21` (v4 mode).
5. **Determinism:** All new stochastic operations must use seeded RNG. Verify with a determinism test (same seed → identical output).

### Validation checklist for v4 dataset

Before declaring v4-M2 complete, the dataset must pass:

- [ ] 1,000 rows, 18 columns
- [ ] 30% conversion rate (±1%)
- [ ] No deterministic groups (n≥50 at 0% or 100% conversion)
- [ ] LR AUC 0.65–0.90 (without leakage trap)
- [ ] LR AUC boost ≥0.03 when leakage trap included
- [ ] `web_sessions` missingness: outbound rate > 3× inbound rate
- [ ] `seniority` missingness: partner_referral rate > 3× others
- [ ] Reproducible with seed 42
- [ ] `total_touches_all` uses full 90-day data (confirmed by AUC boost)

### How to test engine changes locally

```bash
# Quick smoke test: generate a small bundle and inspect
leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 --difficulty intro --n-leads 1000 --out /tmp/test_bundle
leadforge validate /tmp/test_bundle

# Check category signal spread
python -c "
import pandas as pd
df = pd.read_parquet('/tmp/test_bundle/tasks/converted_within_90_days/train.parquet')
for col in ['role_function', 'seniority', 'estimated_revenue_band']:
rates = df.groupby(col)['converted_within_90_days'].mean()
print(f'{col}: spread={rates.max()-rates.min():.1%}')
"
```
For v4 dataset design, engine changes, validation spec, and implementation plan, see `docs/v4/design.md` and `docs/v4/validation_spec.md`.
8 changes: 3 additions & 5 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,16 +349,14 @@ Exception: deliberately included leakage traps (e.g., `total_touches_all` in v4)
## v4 Dataset Plan

The current focus is producing a v4 lead scoring intro dataset. See `docs/v4/` for:
- `lead_scoring_v4_requirements.md` — what v4 must achieve
- `dataset_contract.md` — schema contract and temporal gates
- `engine_changes_spec.md` — what changes in the engine
- `design.md` — requirements, contract, engine changes, implementation plan (single source of truth)
- `validation_spec.md` — automated validation checks
- `implementation_plan.md` — milestone breakdown
- `planning_pr_review.md` — self-review of the planning PR and treatment plan

---

## Reference Docs
- Design decisions: `docs/leadforge_design_doc.md`
- Architecture/spec: `docs/leadforge_architecture_spec.md`
- Implementation roadmap: `docs/leadforge_implementation_plan.md`
- v4 dataset plan: `docs/v4/implementation_plan.md`
- v4 dataset plan: `docs/v4/design.md`
72 changes: 0 additions & 72 deletions docs/v4/dataset_contract.md

This file was deleted.

Loading
Loading