leadforge-dev · shaypal5 · May 29, 2026 · May 29, 2026
diff --git a/.github/workflows/deploy-docs.yml b/.github/workflows/deploy-docs.yml
@@ -0,0 +1,57 @@
+name: Deploy docs to GitHub Pages
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - "website/**"
+      - ".github/workflows/deploy-docs.yml"
+  workflow_dispatch:
+
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+
+concurrency:
+  group: pages
+  cancel-in-progress: false
+
+jobs:
+  build:
+    name: Build Docusaurus site
+    runs-on: ubuntu-latest
+    defaults:
+      run:
+        working-directory: website
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-node@v4
+        with:
+          node-version: "20"
+          cache: npm
+          cache-dependency-path: website/package-lock.json
+
+      - name: Install dependencies
+        run: npm ci
+
+      - name: Build site
+        run: npm run build
+
+      - name: Upload Pages artifact
+        uses: actions/upload-pages-artifact@v3
+        with:
+          path: website/build
+
+  deploy:
+    name: Deploy to GitHub Pages
+    needs: build
+    runs-on: ubuntu-latest
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    steps:
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v4
diff --git a/website/.gitignore b/website/.gitignore
@@ -0,0 +1,4 @@
+node_modules/
+build/
+.docusaurus/
+.cache-loader/
diff --git a/website/README.md b/website/README.md
@@ -0,0 +1,41 @@
+# Website
+
+This website is built using [Docusaurus](https://docusaurus.io/), a modern static website generator.
+
+## Installation
+
+```bash
+yarn
+```
+
+## Local Development
+
+```bash
+yarn start
+```
+
+This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.
+
+## Build
+
+```bash
+yarn build
+```
+
+This command generates static content into the `build` directory and can be served using any static contents hosting service.
+
+## Deployment
+
+Using SSH:
+
+```bash
+USE_SSH=true yarn deploy
+```
+
+Not using SSH:
+
+```bash
+GIT_USER=<Your GitHub username> yarn deploy
+```
+
+If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the `gh-pages` branch.
diff --git a/website/docs/concepts/motif-families.md b/website/docs/concepts/motif-families.md
@@ -0,0 +1,37 @@
+---
+sidebar_position: 3
+title: Motif families
+---
+
+# Motif families
+
+`leadforge` deliberately avoids a single fixed data-generating process (DGP). Instead, the hidden world is sampled from one of five **motif families**, then stochastically rewired. This ensures that:
+
+1. Different dataset instances have genuinely different causal structures.
+2. No single feature engineering recipe is universally optimal.
+3. The true DGP is verifiable via the instructor companion.
+
+## The five families
+
+### `fit_dominant`
+Account and ICP fit traits are the primary path to conversion. High-fit accounts convert at much higher rates regardless of engagement volume. Feature engineering that captures account-level firmographics will dominate.
+
+### `intent_dominant`
+Buying intent signals — session depth, demo requests, content downloads, direct inquiries — are the main driver. Fit alone is insufficient; conversion requires observable interest signals. Engagement-based features and recency weighting matter most.
+
+### `sales_execution_sensitive`
+SDR responsiveness, AE follow-through, and meeting-to-proposal timing are the dominant levers. Two otherwise-identical leads have very different outcomes depending on how quickly and consistently they were worked. Activity cadence features are the key signal.
+
+### `demo_trial_mediated`
+Conversion is causally gated on a demo or trial event. Leads that never reach a demo rarely convert; leads that do have high conversion probability. Models that can identify "reached demo" or "trial active" as a key pathway will perform well.
+
+### `buying_committee_friction`
+Multi-stakeholder dynamics create the primary noise. A lead may have high fit, intent, and SDR attention, but stall because a procurement or finance stakeholder raised objections. Contact-level authority and multi-touch diversity features matter.
+
+## Stochastic rewiring
+
+After sampling a motif family, the graph is subjected to stochastic rewiring: edges are added or removed with small probabilities, and edge weights are perturbed. This means no two generated bundles have exactly the same graph even within the same motif family, while the family-level character is preserved.
+
+## Identifying the motif family
+
+The motif family is **not disclosed** in the student bundle. It is recorded in `metadata/world_spec.json` (instructor mode only). Breaking the dataset — inferring the motif family from the student features — is one of the intended challenges.
diff --git a/website/docs/concepts/output-bundle.md b/website/docs/concepts/output-bundle.md
@@ -0,0 +1,77 @@
+---
+sidebar_position: 4
+title: Output bundle
+---
+
+# Output bundle structure
+
+```
+bundle_root/
+├── manifest.json                   ← provenance, row counts, SHA-256 hashes
+├── dataset_card.md                 ← human-readable documentation
+├── feature_dictionary.csv          ← authoritative column spec
+├── tables/                         ← 9 relational Parquet tables
+│   ├── accounts.parquet
+│   ├── contacts.parquet
+│   ├── leads.parquet
+│   ├── touches.parquet
+│   ├── sessions.parquet
+│   ├── sales_activities.parquet
+│   └── opportunities.parquet
+│   (customers.parquet, subscriptions.parquet — instructor mode only)
+├── tasks/
+│   └── converted_within_90_days/
+│       ├── train.parquet           ← 70% split
+│       ├── valid.parquet           ← 15% split
+│       ├── test.parquet            ← 15% split
+│       └── task_manifest.json
+└── metadata/                       ← instructor mode only
+    ├── world_spec.json
+    ├── graph.graphml
+    ├── graph.json
+    ├── latent_registry.json
+    └── mechanism_summary.json
+```
+
+## `manifest.json`
+
+Records everything needed to reproduce or verify the bundle:
+
+```json
+{
+  "bundle_schema_version": 5,
+  "package_version": "1.0.0",
+  "recipe_id": "b2b_saas_procurement_v1",
+  "seed": 42,
+  "generation_timestamp": "...",
+  "exposure_mode": "student_public",
+  "difficulty_profile": "intermediate",
+  "table_inventory": { "leads": 5000, "accounts": 1500, ... },
+  "file_hashes": { "tables/leads.parquet": "sha256:..." }
+}
+```
+
+## Task splits
+
+Splits are stratified by `converted_within_90_days` and fixed by seed:
+
+| Split | Share | Rows (default 5,000 leads) |
+|---|---|---|
+| `train` | 70% | 3,500 |
+| `valid` | 15% | 750 |
+| `test` | 15% | 750 |
+
+The split spec is recorded in `tasks/converted_within_90_days/task_manifest.json`.
+
+## Validating a bundle
+
+```bash
+leadforge validate ./out/bundle
+```
+
+This checks:
+- SHA-256 hashes in `manifest.json` match all files
+- FK integrity across all relational tables
+- No post-snapshot-anchor timestamps in public tables
+- Conversion rate within declared tier bands
+- No zero-variance features in task splits
diff --git a/website/docs/concepts/overview.md b/website/docs/concepts/overview.md
@@ -0,0 +1,75 @@
+---
+sidebar_position: 1
+title: Overview
+---
+
+# How leadforge works
+
+`leadforge` generates datasets by **simulating a commercial world**, not by sampling rows from a distribution. This distinction matters:
+
+- A distribution-sampler can reproduce the statistical shape of a CRM dataset.
+- A world-simulator produces rows that have *reasons* — leads convert because they fit the ICP, have high urgency, and were engaged by a persistent SDR; leads don't convert because they stalled in technical review, or because the champion left the company.
+
+That structure is what makes the data useful for teaching: there is something real to find, and it can be found with the right feature engineering and model choices.
+
+## The generation pipeline
+
+Generation runs in five sequential layers, each deterministic given the same seed:
+
+```
+1. Hidden world structure   ← sample motif family, rewire DAG
+         ↓
+2. Mechanism layer          ← assign mechanisms to every node
+         ↓
+3. Population layer         ← create accounts, contacts, leads with latent traits
+         ↓
+4. Simulation               ← run 90-day daily event loop
+         ↓
+5. Rendering                ← snapshot-safe feature extraction + relational export
+```
+
+### 1. Hidden world structure
+
+A directed acyclic graph (DAG) of latent traits, pipeline states, and the conversion outcome is sampled from one of five **motif families** and then stochastically rewired. The motif families are:
+
+| Family | What drives conversion |
+|---|---|
+| `fit_dominant` | Account/ICP fit is the primary signal |
+| `intent_dominant` | Buying intent signals (sessions, demo requests) dominate |
+| `sales_execution_sensitive` | SDR and AE behaviour is the strongest lever |
+| `demo_trial_mediated` | Conversion is gated on a demo or trial event |
+| `buying_committee_friction` | Multi-stakeholder dynamics create the main noise |
+
+### 2. Mechanism layer
+
+Every node in the sampled graph gets a concrete mechanism — a logistic latent score, Poisson intensity, recency-decayed engagement intensity, categorical channel influence, stage transition hazard, or conversion hazard. Parameters are calibrated per difficulty tier.
+
+### 3. Population layer
+
+Accounts (1,500), contacts (4,200), and leads (5,000) are instantiated with deterministic IDs (`acct_000001`, `lead_000001`) and latent trait vectors drawn from the world graph.
+
+### 4. Simulation
+
+A hybrid discrete-time simulator runs a 90-day daily loop. Each day, each active lead may:
+
+- receive a touch (email, call, demo, etc.)
+- generate a session
+- receive a sales activity
+- advance or stall in the pipeline stage sequence
+- convert (via a calibrated hazard function)
+
+Everything is event-derived — the `converted_within_90_days` label emerges from simulated events, not from a directly sampled Bernoulli.
+
+### 5. Rendering
+
+The simulation state is projected into:
+
+- 9 relational tables — snapshot-filtered to ≤ anchor day for public bundles
+- A flat ML-ready task table (the train/valid/test splits)
+- Metadata files (manifest, feature dictionary, dataset card)
+
+The **exposure mode** controls what gets written.
+
+## Reproducibility
+
+All generation is deterministic given `(recipe, config, seed, package version)`. The seed is recorded in `manifest.json` along with the package version, so any bundle can be exactly reproduced.
diff --git a/website/docs/concepts/world-simulation.md b/website/docs/concepts/world-simulation.md
@@ -0,0 +1,47 @@
+---
+sidebar_position: 2
+title: World simulation
+---
+
+# World simulation
+
+## The fictional world
+
+Every `leadforge` dataset is grounded in a fictional but internally consistent commercial world. For v1, that world is:
+
+> **Veridian Technologies**, a mid-market B2B SaaS company selling procurement and AP workflow automation software ("Veridian Procure") to 200–2,000 employee firms in the US and UK, through a mixed inbound, SDR-assisted, and partner-driven go-to-market motion.
+
+The company narrative, product details, buyer personas, and funnel structure are all declared in a **recipe YAML** and rendered into the dataset card, feature descriptions, and metadata.
+
+## Entities
+
+The simulation tracks 9 entity types, mirroring a real CRM:
+
+| Table | What it represents |
+|---|---|
+| `accounts` | Companies (the buying org) |
+| `contacts` | People at each account |
+| `leads` | A contact at a specific account entering the funnel |
+| `touches` | Outbound and inbound engagement events |
+| `sessions` | Website/product sessions |
+| `sales_activities` | SDR/AE-logged activities (calls, emails, meetings) |
+| `opportunities` | Formal pipeline records |
+| `customers` | Post-conversion account status (instructor only) |
+| `subscriptions` | Subscription records (instructor only) |
+
+## The latent trait system
+
+Each entity carries a vector of latent traits that are **not directly observable** in the student dataset. Examples:
+
+- `account_fit_score` — how well the account matches the ICP
+- `contact_authority` — decision-making authority of the contact
+- `problem_awareness` — how aware the buyer is of the problem being solved
+- `urgency_score` — time pressure at the account
+
+These traits modulate the probabilities of events and transitions during simulation. The instructor companion exposes them; the student dataset does not.
+
+## Snapshot safety
+
+The primary task is predicting conversion *within 90 days from a snapshot anchor date*. The anchor date is per-lead. All features in the public dataset are computed from events **on or before** the anchor date — this is enforced at rendering time, not by convention.
+
+Columns and tables that would allow label reconstruction via joins (e.g., `customers`, `subscriptions`, terminal-stage opportunity fields) are excluded from the public bundle entirely.