-
Notifications
You must be signed in to change notification settings - Fork 0
Plan v4 lead scoring dataset + leadforge engine roadmap #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -25,3 +25,78 @@ Do **not** leave threads unresolved after the commit is pushed. | |||||||||||||||||||||||||
| ## Branch & PR Conventions | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| See CLAUDE.md for the full mandatory branch/PR workflow (branch → commit → update `.agent-plan.md` → open PR). | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| --- | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| ## v4 Implementation Guide | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| ### What is v4? | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| A pedagogically improved lead scoring dataset (single CSV) for an intro ML course. The engine changes are small and targeted. See `docs/v4/` for full specs. | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| ### Implementation order | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||
| v4-M0 (planning PR — already done) | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
| v4-M0 (planning PR — already done) | |
| v4-M0: planning (this PR) |
Copilot
AI
Apr 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section states the v4 build/validation scripts live in the leadforge repo (not datasets-private), but CLAUDE.md’s “Related repos” + workflow example place these scripts in leadforge-datasets-private. Please align the docs on a single canonical home (or document how scripts are shared) to avoid contributors building in the wrong repo.
| - `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap | |
| - `scripts/validate_v4_dataset.py` (new) — dataset-level validation | |
| - These live in the leadforge repo (not datasets-private) | |
| **v4-M3 (release):** | |
| - Work in `leadforge-datasets-private` repo | |
| - Work in `leadforge-datasets-private` repo | |
| - `scripts/build_v4_snapshot.py` (new) — snapshot builder with missingness + leakage trap | |
| - `scripts/validate_v4_dataset.py` (new) — dataset-level validation | |
| **v4-M3 (release):** | |
| - Continue in `leadforge-datasets-private` repo for dataset generation and release artifacts |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -212,7 +212,153 @@ Key abstractions: `Recipe`, `GenerationConfig`, `WorldSpec`, `WorldBundle`, `Exp | |||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Repository Map | ||||||
|
|
||||||
| ``` | ||||||
| leadforge/ # Python package root | ||||||
| ├── api/ # Public API: Generator, Recipe, Bundle | ||||||
| │ ├── generator.py # Generator.from_recipe() → .generate() → WorldBundle | ||||||
| │ ├── recipes.py # Recipe loading, config resolution | ||||||
| │ └── bundle.py # write_bundle() orchestrator | ||||||
| ├── cli/ # Click CLI | ||||||
| │ ├── main.py # CLI entry point | ||||||
| │ └── commands/ # generate, inspect, validate, list_recipes | ||||||
| ├── core/ # Foundational utilities | ||||||
| │ ├── rng.py # RNGRoot with named substreams | ||||||
| │ ├── ids.py # Deterministic ID generation (acct_000001, etc.) | ||||||
| │ ├── models.py # GenerationConfig, WorldSpec, WorldBundle | ||||||
| │ ├── enums.py # ExposureMode, DifficultyProfile | ||||||
| │ └── exceptions.py # Custom exception hierarchy | ||||||
| ├── narrative/ # Vertical narrative (company, market, personas) | ||||||
| │ ├── spec.py # NarrativeSpec and sub-spec dataclasses | ||||||
| │ └── dataset_card.py # Markdown dataset card renderer | ||||||
| ├── schema/ # Relational data model | ||||||
| │ ├── entities.py # 9 entity row dataclasses (AccountRow, LeadRow, etc.) | ||||||
| │ ├── features.py # LEAD_SNAPSHOT_FEATURES — canonical feature spec | ||||||
| │ ├── relationships.py # FK constraints (ALL_CONSTRAINTS) | ||||||
| │ ├── tasks.py # SplitSpec, TaskManifest, CONVERTED_WITHIN_90_DAYS | ||||||
| │ └── dictionaries.py # Feature dictionary CSV writer | ||||||
| ├── structure/ # Hidden world graph | ||||||
| │ ├── graph.py # WorldGraph (DAG wrapper) | ||||||
| │ ├── motifs.py # 5 motif families | ||||||
| │ ├── rewiring.py # Stochastic graph perturbation | ||||||
| │ └── sampler.py # sample_hidden_graph() | ||||||
| ├── mechanisms/ # Node/edge behavior | ||||||
| │ ├── policies.py # assign_mechanisms() — motif → MechanismAssignment | ||||||
| │ ├── hazards.py # ConversionHazard | ||||||
| │ ├── transitions.py # StageSequence, HazardTransition | ||||||
| │ ├── counts.py # PoissonIntensity, RecencyDecayIntensity | ||||||
| │ ├── categorical.py # CategoricalInfluence, CHANNEL_QUALITY_SCORES | ||||||
| │ └── scores.py # LatentScore | ||||||
| ├── simulation/ # World evolution | ||||||
| │ ├── engine.py # simulate_world() — 90-day daily loop | ||||||
| │ ├── state.py # LeadSimState (per-lead mutable state) | ||||||
| │ └── population.py # build_population() — accounts, contacts, leads | ||||||
| ├── render/ # Bundle output | ||||||
| │ ├── snapshots.py # build_snapshot() — ML-ready lead table | ||||||
| │ ├── relational.py # to_dataframes() — 9-table dict | ||||||
| │ ├── tasks.py # write_task_splits() — train/valid/test Parquet | ||||||
| │ └── manifests.py # build_manifest(), write_manifest() | ||||||
| ├── exposure/ # Truth filtering | ||||||
| │ ├── modes.py # apply_exposure() dispatch | ||||||
| │ ├── metadata.py # write_metadata_dir() for instructor mode | ||||||
| │ └── filters.py # BundleFilter, FILTERS dict | ||||||
| ├── validation/ # Bundle quality checks | ||||||
| │ ├── bundle_checks.py # validate_bundle() orchestrator | ||||||
| │ ├── invariants.py # Determinism + exposure monotonicity | ||||||
| │ ├── realism.py # Conversion rates, feature ranges, stage diversity | ||||||
| │ ├── difficulty.py # Known difficulty profile validation | ||||||
| │ └── drift.py # Cross-seed stability | ||||||
| └── recipes/ # Recipe definitions | ||||||
| └── b2b_saas_procurement_v1/ | ||||||
| ├── recipe.yaml # Recipe metadata + defaults | ||||||
| ├── narrative.yaml # Company, product, market, personas, funnel | ||||||
| └── difficulty_profiles.yaml # intro/intermediate/advanced | ||||||
| ``` | ||||||
|
|
||||||
| ### Related repos | ||||||
|
|
||||||
| - **leadforge-datasets-private** — generated dataset archive | ||||||
| - `b2b_saas_procurement_v1__intro__seed42/` — full relational bundle | ||||||
| - `lead_scoring_intro/` — simplified single-CSV versions (v1–v4) | ||||||
| - `scripts/` — build and validation scripts for simplified CSVs | ||||||
|
||||||
| - `scripts/` — build and validation scripts for simplified CSVs | |
| - Does **not** serve as the source of truth for simplified-CSV build/validation scripts; those live in the **leadforge** repo. |
Copilot
AI
Apr 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This workflow example runs python scripts/build_v4_snapshot.py ... “in leadforge-datasets-private”, which conflicts with AGENTS.md guidance that these scripts live in the leadforge repo. Please reconcile where contributors should implement/run these scripts to prevent drift between repos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v4-M0 is labeled as ⬜, but every deliverable underneath is checked off. Consider marking v4-M0 as complete (✅/done) once this PR lands, or changing the wording to indicate it’s “in review” to keep status signaling consistent.