CLAUDE.md — leadforge

Branch & PR Workflow (mandatory)

Never push directly to main. Every piece of work — feature, bugfix, doc update, plan update — follows this sequence:

git checkout main && git pull — ensure main is up to date.
git checkout -b <descriptive-branch-name> — branch from latest main.
Do the work; commit to the branch.
Update .agent-plan.md to reflect project state after the PR merges; commit that update to the same branch (same PR).
Open a PR against main on GitHub with a detailed description.
Apply the appropriate labels to the PR (create new ones if none fit — see label taxonomy below).
Assign the PR to the appropriate milestone (create a new one on GitHub if none fits).

Never use git push origin main, git push --force origin main, or any variant that targets main directly.

Team enforcement: The above is reinforced by GitHub branch protection on main. The local .git/hooks/pre-push hook installed in this repo is a personal convenience only — it is not versioned and will not be present for other contributors.

Label taxonomy

Type (one required): type: feature · type: bugfix · type: docs · type: test · type: refactor · type: ci · type: chore

Layer (one or more, when touching package code): layer: core · layer: narrative · layer: schema · layer: structure · layer: mechanisms · layer: simulation · layer: render · layer: exposure · layer: validation · layer: cli · layer: api · layer: recipes

Status (optional): status: in progress · status: needs review · status: blocked

Existing labels that predate this taxonomy: bug · documentation · enhancement · good first issue · help wanted · foundation — use when appropriate.

Milestone map

Milestone	Covers	Roadmap
v0.1.0 — Repo & CLI skeleton	M0	Foundation, CI, package scaffold
v0.2.0 — First end-to-end world	M1–M3	Config/recipe, narrative, schema
v0.3.0 — Motif variability + exposure modes	M4–M6	Structure, mechanisms, exposure
v0.4.0 — Polished relational output + task export	M7–M10	Simulation, observation, render, task
v0.5.0 — CLI-complete release candidate	M11–M13	CLI, validation harness
v1.0.0 — Polished OSS release	M14–M15	Sample data, notebooks, docs polish

If work spans multiple milestones, assign to the earliest one it unblocks.

Project Identity

Package / repo / CLI: leadforge
License: MIT
Purpose: opinionated Python framework + CLI for generating synthetic CRM/funnel datasets from simulated commercial worlds
v1 vertical: mid-market procurement / AP automation SaaS
Primary v1 task: converted_within_90_days

Tech Stack

Concern	Choice
Language	Python 3.11+
Linting / formatting	Ruff
Type checking	mypy or pyright
Testing	pytest
Pre-commit	pre-commit hooks
CI	GitHub Actions
Tabular data	pandas + pyarrow / Parquet
Graph internals	networkx.DiGraph
Config models	dataclasses or Pydantic
CLI	(choose at M0 — typer or click)
File format (tables)	Parquet (canonical); CSV optional later
File format (metadata)	JSON
File format (narrative)	Markdown
File format (graph)	GraphML + JSON

CLI Commands

leadforge list-recipes
leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 --mode student_public --difficulty intermediate --n-leads 5000 --out ./out/demo_bundle
leadforge inspect ./out/demo_bundle
leadforge validate ./out/demo_bundle

Dev Commands

pip install -e ".[dev]"          # editable install with dev deps
pytest                            # run all tests
ruff check .                      # lint
ruff format .                     # format
mypy leadforge/                   # type check
pre-commit run --all-files        # pre-commit suite

Architectural Invariants

Generation

All generation is deterministic given (recipe, config, seed, version).
All stochastic components derive from a single seeded RNG root; substreams must be derived deterministically.
External API calls are never required — always optional behind extras.

Data model

Internal world is relational-first. Flat ML exports are derived products.
Use typed dataclasses/models for all config, recipe, world-spec, manifest, and task-manifest objects. No ad hoc dicts at boundaries.
Entity IDs are stable, opaque strings (e.g., acct_000001, lead_000001), unique within namespace and deterministic per run.

Schema

Tables: accounts, contacts, leads, touches, sessions, sales_activities, opportunities, customers, subscriptions.
Primary task table rows = one lead snapshot, anchored at snapshot time.
No flat feature may use events occurring after the snapshot anchor (leakage rule, non-negotiable).

Hidden world

World structure varies via named motif/template families + stochastic rewiring — never a single fixed DGP or unconstrained random graph.
Required v1 motif families: fit-dominant, intent-dominant, sales-execution-sensitive, demo/trial-mediated, buying-committee-friction.
Graph must be a DAG (acyclic). Validate on construction.

Truth exposure

Filtering happens during rendering/publication, not during simulation.
student_public mode: excludes latent registry, full world spec, mechanism summary, rich hidden graph.
research_instructor mode: full truth — hidden graph, world spec, latent registry, mechanism summary, provenance.
ExposureMode enum is central, not ad hoc strings scattered through rendering code.

Output bundle

bundle_root/
  manifest.json          # required in all modes
  dataset_card.md        # required in all modes
  feature_dictionary.csv # required in all modes
  tables/                # relational Parquet tables
  tasks/converted_within_90_days/{train,valid,test}.parquet + task_manifest.json
  metadata/              # exposure-mode filtered

manifest.json must include: package version, recipe id, seed, generation timestamp, exposure mode, difficulty profile, table inventory with row counts, file hashes, bundle_schema_version.

LTV

Customer/subscription entities exist in v1 internals and may appear in relational outputs.
LTV labels are not first-class task outputs in v1.

Simulation

v1 uses hybrid discrete-time simulator (daily steps, 90-day horizon for primary task).
converted_within_90_days is event-derived (not a directly sampled Bernoulli).

Package Layout (canonical)

leadforge/
  api/            generator.py, recipes.py, bundle.py
  cli/            main.py, commands/{generate,list_recipes,inspect,validate}.py
  core/           rng.py, ids.py, time.py, enums.py, models.py, exceptions.py, ...
  narrative/      spec.py, company.py, product.py, personas.py, market.py, funnel.py, dataset_card.py
  schema/         entities.py (EntityRowProtocol, make_empty_dataframe, AccountRow — shared primitives),
                  features.py (FeatureSpec), relationships.py (FKConstraint, validate_fk),
                  tasks.py (SplitSpec, TaskManifest), dictionaries.py, tables.py
  schemes/        base.py (GenerationScheme protocol + SCHEME_REGISTRY);
                  lead_scoring/ — the lead-scoring scheme: __init__.py (build_world/
                  write_bundle) + simulation/, mechanisms/, structure/, render/
                  (moved in LTV-Pf.1/Pf.2);
                  lifecycle/ — the pLTV scheme (stub): entities.py, relationships.py
                  (scaffolded in LTV-Pg.1).  Lead-scoring schema specs migrate
                  under lead_scoring/ in LTV-Pg.2.  See docs/ltv/design.md §2.5.
  render/         relational_io.py (write_relational_tables — shared writer), manifests.py
                  # shared bundle-output envelope
  exposure/       modes.py, filters.py, redaction.py
  validation/     invariants.py, artifact_checks.py, realism.py, difficulty.py, drift.py
  recipes/        registry.py, b2b_saas_procurement_v1/{recipe,narrative,schema,motifs,difficulty_profiles}.yaml
  examples/       notebooks/, configs/
  sample_data/    public/, instructor/

Public API Contract (high-level)

from leadforge.api import Generator, list_recipes

gen = Generator.from_recipe("b2b_saas_procurement_v1", seed=42, exposure_mode="student_public")
bundle = gen.generate(n_accounts=1500, n_contacts=4200, n_leads=5000, difficulty="intermediate")
bundle.save("./out/procurement_world_001")

Key abstractions: Recipe, GenerationConfig, WorldSpec, WorldBundle, ExposureMode.

Config Precedence (highest → lowest)

Explicit function args / CLI flags
User override YAML/JSON file (--override)
Recipe defaults
Package defaults

Commit and PR Conventions

Small-to-medium PRs: ~300–900 lines of meaningful diff.
One logical capability per PR; tests included.
PR title describes capability, not file list.
Tests required for: config parsing, recipe loading, RNG determinism, graph validation, mechanism behavior, serialization, CLI arg parsing.
Property tests required for: graph acyclicity, FK integrity, deterministic output under same seed, exposure filtering monotonicity.

Hard Constraints — Do Not Violate

Never use a single fixed hidden world (DGP must vary by motif family + rewiring).
Never leak post-snapshot-anchor data into flat task features.
Never publish public relational tables that allow label reconstruction via joins. Public relational exports must be snapshot-safe: every *_timestamp column in event tables (touches.touch_timestamp, sessions.session_timestamp, sales_activities.activity_timestamp) must satisfy <= lead_created_at + snapshot_day; opportunities must be filtered by created_at <= lead_created_at + snapshot_day; no terminal-state fields (close_outcome, closed_at, converted_within_90_days, conversion_timestamp) in public leads/opportunities; no conversion-conditional entities (customers, subscriptions) in public bundles.
(lifecycle / b2b_saas_ltv_v1 scheme) The public relational export is snapshot-safe against the absolute observation_date cutoff: every timestamp column in the public event tables (subscription_events.event_timestamp, health_signals.period_start, invoices.invoice_date) must satisfy <= observation_date; the public subscriptions table drops all stateful/terminal columns (subscription_status, current_mrr, renewal_count, expansion_count, subscription_end_at, churn_at, churn_reason), keeping only the at-signing identity (subscription_id, customer_id, plan_name, subscription_start_at, contract_term_months); no pLTV target (ltv_revenue_*) or churn label appears in any public relational table. Each task split carries only its own target (no cross-target leakage); the mrr_change_full_period trap is deliberately retained in all modes. The early-pLTV (tenure-anchored) task family is omitted from student_public bundles — its forward window precedes observation_date, so its targets would be reconstructible by joining the public event tables; it ships in research_instructor only. The calendar-anchored family is published (its targets fall after observation_date).
Never require external APIs for core generation.
Never publish hidden truth in student_public mode.
Never derive converted_within_90_days as a directly sampled label; it must emerge from simulated events.
Never skip schema versioning in manifest.json.
Do not add LTV labels as first-class task outputs in v1.

Repository Map

leadforge/                    # Python package root
├── api/                      # Public API: Generator, Recipe, Bundle
│   ├── generator.py          # Generator.from_recipe() → .generate() → WorldBundle
│   ├── recipes.py            # Recipe loading, config resolution
│   └── bundle.py             # write_bundle() orchestrator
├── cli/                      # Click CLI
│   ├── main.py               # CLI entry point
│   └── commands/             # generate, inspect, validate, list_recipes
├── core/                     # Foundational utilities
│   ├── rng.py                # RNGRoot with named substreams
│   ├── ids.py                # Deterministic ID generation (acct_000001, etc.)
│   ├── models.py             # GenerationConfig, WorldSpec, WorldBundle
│   ├── enums.py              # ExposureMode, DifficultyProfile
│   └── exceptions.py         # Custom exception hierarchy
├── narrative/                # Vertical narrative (company, market, personas)
│   ├── spec.py               # NarrativeSpec and sub-spec dataclasses
│   └── dataset_card.py       # Markdown dataset card renderer
├── schema/                   # Relational data model
│   ├── entities.py           # 9 entity row dataclasses (AccountRow, LeadRow, etc.)
│   ├── features.py           # LEAD_SNAPSHOT_FEATURES — canonical feature spec
│   ├── relationships.py      # FK constraints (ALL_CONSTRAINTS)
│   ├── tasks.py              # SplitSpec, TaskManifest, CONVERTED_WITHIN_90_DAYS
│   └── dictionaries.py       # Feature dictionary CSV writer
├── schemes/                  # Generation schemes (peer pipelines) + registry
│   ├── base.py               # GenerationScheme protocol + SCHEME_REGISTRY
│   ├── lead_scoring/         # The lead-scoring scheme (LeadScoringScheme)
│   │   ├── __init__.py       # build_world() + write_bundle()
│   │   ├── structure/        # Hidden world graph (WorldGraph, motifs, sampler)
│   │   ├── mechanisms/       # Node/edge behavior (policies, hazards, scores, …)
│   │   ├── simulation/       # World evolution (engine, population, state)
│   │   └── render/           # Lead-scoring render: snapshots, relational
│   │                         #   (to_dataframes), relational_snapshot_safe, tasks
│   └── lifecycle/            # The pLTV scheme (LifecycleScheme — stub until M3–M6)
│       ├── __init__.py       # registers the stub scheme
│       ├── entities.py       # lifecycle rows + LIFECYCLE_ROW_TYPES
│       └── relationships.py  # LIFECYCLE_CONSTRAINTS
│   # NOTE (LTV-M2 reorg in progress): lead-scoring schema specs split in LTV-Pg.2.
│   # See docs/ltv/design.md §2.5 for the target layout.
├── render/                   # Shared bundle-output envelope
│   ├── relational_io.py      # write_relational_tables() — shared table writer
│   └── manifests.py          # build_manifest(), write_manifest()
├── exposure/                 # Truth filtering
│   ├── modes.py              # apply_exposure() dispatch
│   ├── metadata.py           # write_metadata_dir() for instructor mode
│   └── filters.py            # BundleFilter, FILTERS dict
├── validation/               # Bundle quality checks
│   ├── bundle_checks.py      # validate_bundle() orchestrator
│   ├── invariants.py         # Determinism + exposure monotonicity
│   ├── realism.py            # Conversion rates, feature ranges, stage diversity
│   ├── difficulty.py         # Known difficulty profile validation
│   └── drift.py              # Cross-seed stability
└── recipes/                  # Recipe definitions
    └── b2b_saas_procurement_v1/
        ├── recipe.yaml       # Recipe metadata + defaults
        ├── narrative.yaml    # Company, product, market, personas, funnel
        └── difficulty_profiles.yaml  # intro/intermediate/advanced

Related repos

leadforge-datasets-private — generated dataset archive
- b2b_saas_procurement_v1__intro__seed42/ — full relational bundle
- lead_scoring_intro/ — simplified single-CSV versions (v1–v4)
- scripts/ — build and validation scripts for simplified CSVs

Generation Workflow

Generate a full bundle

leadforge generate \
  --recipe b2b_saas_procurement_v1 \
  --seed 42 \
  --mode student_public \
  --difficulty intro \
  --n-leads 5000 \
  --out ./out/bundle

Build a simplified CSV (v4 example)

# In leadforge-datasets-private repo:
python scripts/build_v4_snapshot.py /path/to/bundle lead_scoring_intro/lead_scoring_intro_v4.csv

Validate a simplified CSV

python scripts/validate_v4_dataset.py lead_scoring_intro/lead_scoring_intro_v4.csv

Validate a full bundle

leadforge validate ./out/bundle

student_public Mode Invariants

These are non-negotiable for any dataset published in student_public mode:

No post-snapshot features — all features computed from events ≤ snapshot day only.
No outcome-stage columns — current_stage, funnel_stage with closed_won/closed_lost are banned.
No deterministic single-feature mapping — for any feature value with n≥50, conversion rate must be in [2%, 98%].
No hidden truth — latent scores, mechanism parameters, world graph not included.
No direct outcome columns — conversion_timestamp, close_outcome are banned.
No zero-variance features — every included feature must have ≥2 distinct values.

Exception: deliberately included leakage traps (e.g., total_touches_all in v4) must be clearly documented in release notes and feature dictionary.

How to Add New Features to the Snapshot

Add a FeatureSpec entry to LEAD_SNAPSHOT_FEATURES in leadforge/schema/features.py.
Compute the feature value in build_snapshot() in leadforge/render/snapshots.py.
If the feature needs new event data, add it to the simulation loop in leadforge/simulation/engine.py.
Update leadforge/schema/dictionaries.py if the feature dictionary format changes.
Run pytest and leadforge validate on a generated bundle.
Update the feature dictionary CSV description.

v4 Dataset Plan

The current focus is producing a v4 lead scoring intro dataset. See docs/v4/ for:

design.md — requirements, contract, engine changes, implementation plan (single source of truth)
validation_spec.md — automated validation checks
planning_pr_review.md — self-review of the planning PR and treatment plan

Reference Docs

LTV workstream (next, active planning): docs/ltv/design.md + docs/ltv/roadmap.md
Design decisions: docs/leadforge_design_doc.md
Architecture/spec: docs/leadforge_architecture_spec.md
Implementation roadmap: docs/leadforge_implementation_plan.md
v4 dataset plan: docs/v4/design.md
v1 dataset release roadmap (active): docs/release/v1_release_roadmap.md
v1 release design: docs/release/v1_release_design.md
v1 acceptance gates: docs/release/v1_acceptance_gates.md
Post-v1 roadmap: docs/release/post_v1_roadmap.md
External review synthesis: docs/external_review/summaries/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md — leadforge

Branch & PR Workflow (mandatory)

Label taxonomy

Milestone map

Project Identity

Tech Stack

CLI Commands

Dev Commands

Architectural Invariants

Generation

Data model

Schema

Hidden world

Truth exposure

Output bundle

LTV

Simulation

Package Layout (canonical)

Public API Contract (high-level)

Config Precedence (highest → lowest)

Commit and PR Conventions

Hard Constraints — Do Not Violate

Repository Map

Related repos

Generation Workflow

Generate a full bundle

Build a simplified CSV (v4 example)

Validate a simplified CSV

Validate a full bundle

student_public Mode Invariants

How to Add New Features to the Snapshot

v4 Dataset Plan

Reference Docs

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md — leadforge

Branch & PR Workflow (mandatory)

Label taxonomy

Milestone map

Project Identity

Tech Stack

CLI Commands

Dev Commands

Architectural Invariants

Generation

Data model

Schema

Hidden world

Truth exposure

Output bundle

LTV

Simulation

Package Layout (canonical)

Public API Contract (high-level)

Config Precedence (highest → lowest)

Commit and PR Conventions

Hard Constraints — Do Not Violate

Repository Map

Related repos

Generation Workflow

Generate a full bundle

Build a simplified CSV (v4 example)

Validate a simplified CSV

Validate a full bundle

student_public Mode Invariants

How to Add New Features to the Snapshot

v4 Dataset Plan

Reference Docs