Skip to content

Latest commit

 

History

History
374 lines (298 loc) · 18.9 KB

File metadata and controls

374 lines (298 loc) · 18.9 KB

CLAUDE.md — leadforge

Branch & PR Workflow (mandatory)

Never push directly to main. Every piece of work — feature, bugfix, doc update, plan update — follows this sequence:

  1. git checkout main && git pull — ensure main is up to date.
  2. git checkout -b <descriptive-branch-name> — branch from latest main.
  3. Do the work; commit to the branch.
  4. Update .agent-plan.md to reflect project state after the PR merges; commit that update to the same branch (same PR).
  5. Open a PR against main on GitHub with a detailed description.
  6. Apply the appropriate labels to the PR (create new ones if none fit — see label taxonomy below).
  7. Assign the PR to the appropriate milestone (create a new one on GitHub if none fits).

Never use git push origin main, git push --force origin main, or any variant that targets main directly.

Team enforcement: The above is reinforced by GitHub branch protection on main. The local .git/hooks/pre-push hook installed in this repo is a personal convenience only — it is not versioned and will not be present for other contributors.

Label taxonomy

Type (one required): type: feature · type: bugfix · type: docs · type: test · type: refactor · type: ci · type: chore

Layer (one or more, when touching package code): layer: core · layer: narrative · layer: schema · layer: structure · layer: mechanisms · layer: simulation · layer: render · layer: exposure · layer: validation · layer: cli · layer: api · layer: recipes

Status (optional): status: in progress · status: needs review · status: blocked

Existing labels that predate this taxonomy: bug · documentation · enhancement · good first issue · help wanted · foundation — use when appropriate.

Milestone map

Milestone Covers Roadmap
v0.1.0 — Repo & CLI skeleton M0 Foundation, CI, package scaffold
v0.2.0 — First end-to-end world M1–M3 Config/recipe, narrative, schema
v0.3.0 — Motif variability + exposure modes M4–M6 Structure, mechanisms, exposure
v0.4.0 — Polished relational output + task export M7–M10 Simulation, observation, render, task
v0.5.0 — CLI-complete release candidate M11–M13 CLI, validation harness
v1.0.0 — Polished OSS release M14–M15 Sample data, notebooks, docs polish

If work spans multiple milestones, assign to the earliest one it unblocks.


Project Identity

  • Package / repo / CLI: leadforge
  • License: MIT
  • Purpose: opinionated Python framework + CLI for generating synthetic CRM/funnel datasets from simulated commercial worlds
  • v1 vertical: mid-market procurement / AP automation SaaS
  • Primary v1 task: converted_within_90_days

Tech Stack

Concern Choice
Language Python 3.11+
Linting / formatting Ruff
Type checking mypy or pyright
Testing pytest
Pre-commit pre-commit hooks
CI GitHub Actions
Tabular data pandas + pyarrow / Parquet
Graph internals networkx.DiGraph
Config models dataclasses or Pydantic
CLI (choose at M0 — typer or click)
File format (tables) Parquet (canonical); CSV optional later
File format (metadata) JSON
File format (narrative) Markdown
File format (graph) GraphML + JSON

CLI Commands

leadforge list-recipes
leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 --mode student_public --difficulty intermediate --n-leads 5000 --out ./out/demo_bundle
leadforge inspect ./out/demo_bundle
leadforge validate ./out/demo_bundle

Dev Commands

pip install -e ".[dev]"          # editable install with dev deps
pytest                            # run all tests
ruff check .                      # lint
ruff format .                     # format
mypy leadforge/                   # type check
pre-commit run --all-files        # pre-commit suite

Architectural Invariants

Generation

  • All generation is deterministic given (recipe, config, seed, version).
  • All stochastic components derive from a single seeded RNG root; substreams must be derived deterministically.
  • External API calls are never required — always optional behind extras.

Data model

  • Internal world is relational-first. Flat ML exports are derived products.
  • Use typed dataclasses/models for all config, recipe, world-spec, manifest, and task-manifest objects. No ad hoc dicts at boundaries.
  • Entity IDs are stable, opaque strings (e.g., acct_000001, lead_000001), unique within namespace and deterministic per run.

Schema

  • Tables: accounts, contacts, leads, touches, sessions, sales_activities, opportunities, customers, subscriptions.
  • Primary task table rows = one lead snapshot, anchored at snapshot time.
  • No flat feature may use events occurring after the snapshot anchor (leakage rule, non-negotiable).

Hidden world

  • World structure varies via named motif/template families + stochastic rewiring — never a single fixed DGP or unconstrained random graph.
  • Required v1 motif families: fit-dominant, intent-dominant, sales-execution-sensitive, demo/trial-mediated, buying-committee-friction.
  • Graph must be a DAG (acyclic). Validate on construction.

Truth exposure

  • Filtering happens during rendering/publication, not during simulation.
  • student_public mode: excludes latent registry, full world spec, mechanism summary, rich hidden graph.
  • research_instructor mode: full truth — hidden graph, world spec, latent registry, mechanism summary, provenance.
  • ExposureMode enum is central, not ad hoc strings scattered through rendering code.

Output bundle

bundle_root/
  manifest.json          # required in all modes
  dataset_card.md        # required in all modes
  feature_dictionary.csv # required in all modes
  tables/                # relational Parquet tables
  tasks/converted_within_90_days/{train,valid,test}.parquet + task_manifest.json
  metadata/              # exposure-mode filtered
  • manifest.json must include: package version, recipe id, seed, generation timestamp, exposure mode, difficulty profile, table inventory with row counts, file hashes, bundle_schema_version.

LTV

  • Customer/subscription entities exist in v1 internals and may appear in relational outputs.
  • LTV labels are not first-class task outputs in v1.

Simulation

  • v1 uses hybrid discrete-time simulator (daily steps, 90-day horizon for primary task).
  • converted_within_90_days is event-derived (not a directly sampled Bernoulli).

Package Layout (canonical)

leadforge/
  api/            generator.py, recipes.py, bundle.py
  cli/            main.py, commands/{generate,list_recipes,inspect,validate}.py
  core/           rng.py, ids.py, time.py, enums.py, models.py, exceptions.py, ...
  narrative/      spec.py, company.py, product.py, personas.py, market.py, funnel.py, dataset_card.py
  schema/         entities.py (EntityRowProtocol, make_empty_dataframe, AccountRow — shared primitives),
                  features.py (FeatureSpec), relationships.py (FKConstraint, validate_fk),
                  tasks.py (SplitSpec, TaskManifest), dictionaries.py, tables.py
  schemes/        base.py (GenerationScheme protocol + SCHEME_REGISTRY);
                  lead_scoring/ — the lead-scoring scheme: __init__.py (build_world/
                  write_bundle) + simulation/, mechanisms/, structure/, render/
                  (moved in LTV-Pf.1/Pf.2);
                  lifecycle/ — the pLTV scheme (stub): entities.py, relationships.py
                  (scaffolded in LTV-Pg.1).  Lead-scoring schema specs migrate
                  under lead_scoring/ in LTV-Pg.2.  See docs/ltv/design.md §2.5.
  render/         relational_io.py (write_relational_tables — shared writer), manifests.py
                  # shared bundle-output envelope
  exposure/       modes.py, filters.py, redaction.py
  validation/     invariants.py, artifact_checks.py, realism.py, difficulty.py, drift.py
  recipes/        registry.py, b2b_saas_procurement_v1/{recipe,narrative,schema,motifs,difficulty_profiles}.yaml
  examples/       notebooks/, configs/
  sample_data/    public/, instructor/

Public API Contract (high-level)

from leadforge.api import Generator, list_recipes

gen = Generator.from_recipe("b2b_saas_procurement_v1", seed=42, exposure_mode="student_public")
bundle = gen.generate(n_accounts=1500, n_contacts=4200, n_leads=5000, difficulty="intermediate")
bundle.save("./out/procurement_world_001")

Key abstractions: Recipe, GenerationConfig, WorldSpec, WorldBundle, ExposureMode.


Config Precedence (highest → lowest)

  1. Explicit function args / CLI flags
  2. User override YAML/JSON file (--override)
  3. Recipe defaults
  4. Package defaults

Commit and PR Conventions

  • Small-to-medium PRs: ~300–900 lines of meaningful diff.
  • One logical capability per PR; tests included.
  • PR title describes capability, not file list.
  • Tests required for: config parsing, recipe loading, RNG determinism, graph validation, mechanism behavior, serialization, CLI arg parsing.
  • Property tests required for: graph acyclicity, FK integrity, deterministic output under same seed, exposure filtering monotonicity.

Hard Constraints — Do Not Violate

  • Never use a single fixed hidden world (DGP must vary by motif family + rewiring).
  • Never leak post-snapshot-anchor data into flat task features.
  • Never publish public relational tables that allow label reconstruction via joins. Public relational exports must be snapshot-safe: every *_timestamp column in event tables (touches.touch_timestamp, sessions.session_timestamp, sales_activities.activity_timestamp) must satisfy <= lead_created_at + snapshot_day; opportunities must be filtered by created_at <= lead_created_at + snapshot_day; no terminal-state fields (close_outcome, closed_at, converted_within_90_days, conversion_timestamp) in public leads/opportunities; no conversion-conditional entities (customers, subscriptions) in public bundles.
  • (lifecycle / b2b_saas_ltv_v1 scheme) The public relational export is snapshot-safe against the absolute observation_date cutoff: every timestamp column in the public event tables (subscription_events.event_timestamp, health_signals.period_start, invoices.invoice_date) must satisfy <= observation_date; the public subscriptions table drops all stateful/terminal columns (subscription_status, current_mrr, renewal_count, expansion_count, subscription_end_at, churn_at, churn_reason), keeping only the at-signing identity (subscription_id, customer_id, plan_name, subscription_start_at, contract_term_months); no pLTV target (ltv_revenue_*) or churn label appears in any public relational table. Each task split carries only its own target (no cross-target leakage); the mrr_change_full_period trap is deliberately retained in all modes. The early-pLTV (tenure-anchored) task family is omitted from student_public bundles — its forward window precedes observation_date, so its targets would be reconstructible by joining the public event tables; it ships in research_instructor only. The calendar-anchored family is published (its targets fall after observation_date).
  • Never require external APIs for core generation.
  • Never publish hidden truth in student_public mode.
  • Never derive converted_within_90_days as a directly sampled label; it must emerge from simulated events.
  • Never skip schema versioning in manifest.json.
  • Do not add LTV labels as first-class task outputs in v1.

Repository Map

leadforge/                    # Python package root
├── api/                      # Public API: Generator, Recipe, Bundle
│   ├── generator.py          # Generator.from_recipe() → .generate() → WorldBundle
│   ├── recipes.py            # Recipe loading, config resolution
│   └── bundle.py             # write_bundle() orchestrator
├── cli/                      # Click CLI
│   ├── main.py               # CLI entry point
│   └── commands/             # generate, inspect, validate, list_recipes
├── core/                     # Foundational utilities
│   ├── rng.py                # RNGRoot with named substreams
│   ├── ids.py                # Deterministic ID generation (acct_000001, etc.)
│   ├── models.py             # GenerationConfig, WorldSpec, WorldBundle
│   ├── enums.py              # ExposureMode, DifficultyProfile
│   └── exceptions.py         # Custom exception hierarchy
├── narrative/                # Vertical narrative (company, market, personas)
│   ├── spec.py               # NarrativeSpec and sub-spec dataclasses
│   └── dataset_card.py       # Markdown dataset card renderer
├── schema/                   # Relational data model
│   ├── entities.py           # 9 entity row dataclasses (AccountRow, LeadRow, etc.)
│   ├── features.py           # LEAD_SNAPSHOT_FEATURES — canonical feature spec
│   ├── relationships.py      # FK constraints (ALL_CONSTRAINTS)
│   ├── tasks.py              # SplitSpec, TaskManifest, CONVERTED_WITHIN_90_DAYS
│   └── dictionaries.py       # Feature dictionary CSV writer
├── schemes/                  # Generation schemes (peer pipelines) + registry
│   ├── base.py               # GenerationScheme protocol + SCHEME_REGISTRY
│   ├── lead_scoring/         # The lead-scoring scheme (LeadScoringScheme)
│   │   ├── __init__.py       # build_world() + write_bundle()
│   │   ├── structure/        # Hidden world graph (WorldGraph, motifs, sampler)
│   │   ├── mechanisms/       # Node/edge behavior (policies, hazards, scores, …)
│   │   ├── simulation/       # World evolution (engine, population, state)
│   │   └── render/           # Lead-scoring render: snapshots, relational
│   │                         #   (to_dataframes), relational_snapshot_safe, tasks
│   └── lifecycle/            # The pLTV scheme (LifecycleScheme — stub until M3–M6)
│       ├── __init__.py       # registers the stub scheme
│       ├── entities.py       # lifecycle rows + LIFECYCLE_ROW_TYPES
│       └── relationships.py  # LIFECYCLE_CONSTRAINTS
│   # NOTE (LTV-M2 reorg in progress): lead-scoring schema specs split in LTV-Pg.2.
│   # See docs/ltv/design.md §2.5 for the target layout.
├── render/                   # Shared bundle-output envelope
│   ├── relational_io.py      # write_relational_tables() — shared table writer
│   └── manifests.py          # build_manifest(), write_manifest()
├── exposure/                 # Truth filtering
│   ├── modes.py              # apply_exposure() dispatch
│   ├── metadata.py           # write_metadata_dir() for instructor mode
│   └── filters.py            # BundleFilter, FILTERS dict
├── validation/               # Bundle quality checks
│   ├── bundle_checks.py      # validate_bundle() orchestrator
│   ├── invariants.py         # Determinism + exposure monotonicity
│   ├── realism.py            # Conversion rates, feature ranges, stage diversity
│   ├── difficulty.py         # Known difficulty profile validation
│   └── drift.py              # Cross-seed stability
└── recipes/                  # Recipe definitions
    └── b2b_saas_procurement_v1/
        ├── recipe.yaml       # Recipe metadata + defaults
        ├── narrative.yaml    # Company, product, market, personas, funnel
        └── difficulty_profiles.yaml  # intro/intermediate/advanced

Related repos

  • leadforge-datasets-private — generated dataset archive
    • b2b_saas_procurement_v1__intro__seed42/ — full relational bundle
    • lead_scoring_intro/ — simplified single-CSV versions (v1–v4)
    • scripts/ — build and validation scripts for simplified CSVs

Generation Workflow

Generate a full bundle

leadforge generate \
  --recipe b2b_saas_procurement_v1 \
  --seed 42 \
  --mode student_public \
  --difficulty intro \
  --n-leads 5000 \
  --out ./out/bundle

Build a simplified CSV (v4 example)

# In leadforge-datasets-private repo:
python scripts/build_v4_snapshot.py /path/to/bundle lead_scoring_intro/lead_scoring_intro_v4.csv

Validate a simplified CSV

python scripts/validate_v4_dataset.py lead_scoring_intro/lead_scoring_intro_v4.csv

Validate a full bundle

leadforge validate ./out/bundle

student_public Mode Invariants

These are non-negotiable for any dataset published in student_public mode:

  1. No post-snapshot features — all features computed from events ≤ snapshot day only.
  2. No outcome-stage columnscurrent_stage, funnel_stage with closed_won/closed_lost are banned.
  3. No deterministic single-feature mapping — for any feature value with n≥50, conversion rate must be in [2%, 98%].
  4. No hidden truth — latent scores, mechanism parameters, world graph not included.
  5. No direct outcome columnsconversion_timestamp, close_outcome are banned.
  6. No zero-variance features — every included feature must have ≥2 distinct values.

Exception: deliberately included leakage traps (e.g., total_touches_all in v4) must be clearly documented in release notes and feature dictionary.


How to Add New Features to the Snapshot

  1. Add a FeatureSpec entry to LEAD_SNAPSHOT_FEATURES in leadforge/schema/features.py.
  2. Compute the feature value in build_snapshot() in leadforge/render/snapshots.py.
  3. If the feature needs new event data, add it to the simulation loop in leadforge/simulation/engine.py.
  4. Update leadforge/schema/dictionaries.py if the feature dictionary format changes.
  5. Run pytest and leadforge validate on a generated bundle.
  6. Update the feature dictionary CSV description.

v4 Dataset Plan

The current focus is producing a v4 lead scoring intro dataset. See docs/v4/ for:

  • design.md — requirements, contract, engine changes, implementation plan (single source of truth)
  • validation_spec.md — automated validation checks
  • planning_pr_review.md — self-review of the planning PR and treatment plan

Reference Docs

  • LTV workstream (next, active planning): docs/ltv/design.md + docs/ltv/roadmap.md
  • Design decisions: docs/leadforge_design_doc.md
  • Architecture/spec: docs/leadforge_architecture_spec.md
  • Implementation roadmap: docs/leadforge_implementation_plan.md
  • v4 dataset plan: docs/v4/design.md
  • v1 dataset release roadmap (active): docs/release/v1_release_roadmap.md
  • v1 release design: docs/release/v1_release_design.md
  • v1 acceptance gates: docs/release/v1_acceptance_gates.md
  • Post-v1 roadmap: docs/release/post_v1_roadmap.md
  • External review synthesis: docs/external_review/summaries/