Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions .github/workflows/deploy-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
name: Deploy docs to GitHub Pages

on:
push:
branches: [main]
paths:
- "website/**"
- ".github/workflows/deploy-docs.yml"
workflow_dispatch:

permissions:
contents: read
pages: write
id-token: write

concurrency:
group: pages
cancel-in-progress: false

jobs:
build:
name: Build Docusaurus site
runs-on: ubuntu-latest
defaults:
run:
working-directory: website
steps:
- uses: actions/checkout@v4

- uses: actions/setup-node@v4
with:
node-version: "20"
cache: npm
cache-dependency-path: website/package-lock.json

- name: Install dependencies
run: npm ci

- name: Build site
run: npm run build

- name: Upload Pages artifact
uses: actions/upload-pages-artifact@v3
with:
path: website/build

deploy:
name: Deploy to GitHub Pages
needs: build
runs-on: ubuntu-latest
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4
4 changes: 4 additions & 0 deletions website/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
node_modules/
build/
.docusaurus/
.cache-loader/
41 changes: 41 additions & 0 deletions website/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Website

This website is built using [Docusaurus](https://docusaurus.io/), a modern static website generator.

## Installation

```bash
yarn
```

## Local Development

```bash
yarn start
```

This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.

## Build

```bash
yarn build
```

This command generates static content into the `build` directory and can be served using any static contents hosting service.

## Deployment

Using SSH:

```bash
USE_SSH=true yarn deploy
```

Not using SSH:

```bash
GIT_USER=<Your GitHub username> yarn deploy
```

If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the `gh-pages` branch.
37 changes: 37 additions & 0 deletions website/docs/concepts/motif-families.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
sidebar_position: 3
title: Motif families
---

# Motif families

`leadforge` deliberately avoids a single fixed data-generating process (DGP). Instead, the hidden world is sampled from one of five **motif families**, then stochastically rewired. This ensures that:

1. Different dataset instances have genuinely different causal structures.
2. No single feature engineering recipe is universally optimal.
3. The true DGP is verifiable via the instructor companion.

## The five families

### `fit_dominant`
Account and ICP fit traits are the primary path to conversion. High-fit accounts convert at much higher rates regardless of engagement volume. Feature engineering that captures account-level firmographics will dominate.

### `intent_dominant`
Buying intent signals — session depth, demo requests, content downloads, direct inquiries — are the main driver. Fit alone is insufficient; conversion requires observable interest signals. Engagement-based features and recency weighting matter most.

### `sales_execution_sensitive`
SDR responsiveness, AE follow-through, and meeting-to-proposal timing are the dominant levers. Two otherwise-identical leads have very different outcomes depending on how quickly and consistently they were worked. Activity cadence features are the key signal.

### `demo_trial_mediated`
Conversion is causally gated on a demo or trial event. Leads that never reach a demo rarely convert; leads that do have high conversion probability. Models that can identify "reached demo" or "trial active" as a key pathway will perform well.

### `buying_committee_friction`
Multi-stakeholder dynamics create the primary noise. A lead may have high fit, intent, and SDR attention, but stall because a procurement or finance stakeholder raised objections. Contact-level authority and multi-touch diversity features matter.

## Stochastic rewiring

After sampling a motif family, the graph is subjected to stochastic rewiring: edges are added or removed with small probabilities, and edge weights are perturbed. This means no two generated bundles have exactly the same graph even within the same motif family, while the family-level character is preserved.

## Identifying the motif family

The motif family is **not disclosed** in the student bundle. It is recorded in `metadata/world_spec.json` (instructor mode only). Breaking the dataset — inferring the motif family from the student features — is one of the intended challenges.
77 changes: 77 additions & 0 deletions website/docs/concepts/output-bundle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
sidebar_position: 4
title: Output bundle
---

# Output bundle structure

```
bundle_root/
├── manifest.json ← provenance, row counts, SHA-256 hashes
├── dataset_card.md ← human-readable documentation
├── feature_dictionary.csv ← authoritative column spec
├── tables/ ← 9 relational Parquet tables
│ ├── accounts.parquet
│ ├── contacts.parquet
│ ├── leads.parquet
│ ├── touches.parquet
│ ├── sessions.parquet
│ ├── sales_activities.parquet
│ └── opportunities.parquet
│ (customers.parquet, subscriptions.parquet — instructor mode only)
├── tasks/
│ └── converted_within_90_days/
│ ├── train.parquet ← 70% split
│ ├── valid.parquet ← 15% split
│ ├── test.parquet ← 15% split
│ └── task_manifest.json
└── metadata/ ← instructor mode only
├── world_spec.json
├── graph.graphml
├── graph.json
├── latent_registry.json
└── mechanism_summary.json
```

## `manifest.json`

Records everything needed to reproduce or verify the bundle:

```json
{
"bundle_schema_version": 5,
"package_version": "1.0.0",
"recipe_id": "b2b_saas_procurement_v1",
"seed": 42,
"generation_timestamp": "...",
"exposure_mode": "student_public",
"difficulty_profile": "intermediate",
"table_inventory": { "leads": 5000, "accounts": 1500, ... },
"file_hashes": { "tables/leads.parquet": "sha256:..." }
}
```

## Task splits

Splits are stratified by `converted_within_90_days` and fixed by seed:

| Split | Share | Rows (default 5,000 leads) |
|---|---|---|
| `train` | 70% | 3,500 |
| `valid` | 15% | 750 |
| `test` | 15% | 750 |

The split spec is recorded in `tasks/converted_within_90_days/task_manifest.json`.

## Validating a bundle

```bash
leadforge validate ./out/bundle
```

This checks:
- SHA-256 hashes in `manifest.json` match all files
- FK integrity across all relational tables
- No post-snapshot-anchor timestamps in public tables
- Conversion rate within declared tier bands
- No zero-variance features in task splits
75 changes: 75 additions & 0 deletions website/docs/concepts/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
sidebar_position: 1
title: Overview
---

# How leadforge works

`leadforge` generates datasets by **simulating a commercial world**, not by sampling rows from a distribution. This distinction matters:

- A distribution-sampler can reproduce the statistical shape of a CRM dataset.
- A world-simulator produces rows that have *reasons* — leads convert because they fit the ICP, have high urgency, and were engaged by a persistent SDR; leads don't convert because they stalled in technical review, or because the champion left the company.

That structure is what makes the data useful for teaching: there is something real to find, and it can be found with the right feature engineering and model choices.

## The generation pipeline

Generation runs in five sequential layers, each deterministic given the same seed:

```
1. Hidden world structure ← sample motif family, rewire DAG
2. Mechanism layer ← assign mechanisms to every node
3. Population layer ← create accounts, contacts, leads with latent traits
4. Simulation ← run 90-day daily event loop
5. Rendering ← snapshot-safe feature extraction + relational export
```

### 1. Hidden world structure

A directed acyclic graph (DAG) of latent traits, pipeline states, and the conversion outcome is sampled from one of five **motif families** and then stochastically rewired. The motif families are:

| Family | What drives conversion |
|---|---|
| `fit_dominant` | Account/ICP fit is the primary signal |
| `intent_dominant` | Buying intent signals (sessions, demo requests) dominate |
| `sales_execution_sensitive` | SDR and AE behaviour is the strongest lever |
| `demo_trial_mediated` | Conversion is gated on a demo or trial event |
| `buying_committee_friction` | Multi-stakeholder dynamics create the main noise |

### 2. Mechanism layer

Every node in the sampled graph gets a concrete mechanism — a logistic latent score, Poisson intensity, recency-decayed engagement intensity, categorical channel influence, stage transition hazard, or conversion hazard. Parameters are calibrated per difficulty tier.

### 3. Population layer

Accounts (1,500), contacts (4,200), and leads (5,000) are instantiated with deterministic IDs (`acct_000001`, `lead_000001`) and latent trait vectors drawn from the world graph.

### 4. Simulation

A hybrid discrete-time simulator runs a 90-day daily loop. Each day, each active lead may:

- receive a touch (email, call, demo, etc.)
- generate a session
- receive a sales activity
- advance or stall in the pipeline stage sequence
- convert (via a calibrated hazard function)

Everything is event-derived — the `converted_within_90_days` label emerges from simulated events, not from a directly sampled Bernoulli.

### 5. Rendering

The simulation state is projected into:

- 9 relational tables — snapshot-filtered to ≤ anchor day for public bundles
- A flat ML-ready task table (the train/valid/test splits)
- Metadata files (manifest, feature dictionary, dataset card)

The **exposure mode** controls what gets written.

## Reproducibility

All generation is deterministic given `(recipe, config, seed, package version)`. The seed is recorded in `manifest.json` along with the package version, so any bundle can be exactly reproduced.
47 changes: 47 additions & 0 deletions website/docs/concepts/world-simulation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
sidebar_position: 2
title: World simulation
---

# World simulation

## The fictional world

Every `leadforge` dataset is grounded in a fictional but internally consistent commercial world. For v1, that world is:

> **Veridian Technologies**, a mid-market B2B SaaS company selling procurement and AP workflow automation software ("Veridian Procure") to 200–2,000 employee firms in the US and UK, through a mixed inbound, SDR-assisted, and partner-driven go-to-market motion.

The company narrative, product details, buyer personas, and funnel structure are all declared in a **recipe YAML** and rendered into the dataset card, feature descriptions, and metadata.

## Entities

The simulation tracks 9 entity types, mirroring a real CRM:

| Table | What it represents |
|---|---|
| `accounts` | Companies (the buying org) |
| `contacts` | People at each account |
| `leads` | A contact at a specific account entering the funnel |
| `touches` | Outbound and inbound engagement events |
| `sessions` | Website/product sessions |
| `sales_activities` | SDR/AE-logged activities (calls, emails, meetings) |
| `opportunities` | Formal pipeline records |
| `customers` | Post-conversion account status (instructor only) |
| `subscriptions` | Subscription records (instructor only) |

## The latent trait system

Each entity carries a vector of latent traits that are **not directly observable** in the student dataset. Examples:

- `account_fit_score` — how well the account matches the ICP
- `contact_authority` — decision-making authority of the contact
- `problem_awareness` — how aware the buyer is of the problem being solved
- `urgency_score` — time pressure at the account

These traits modulate the probabilities of events and transitions during simulation. The instructor companion exposes them; the student dataset does not.

## Snapshot safety

The primary task is predicting conversion *within 90 days from a snapshot anchor date*. The anchor date is per-lead. All features in the public dataset are computed from events **on or before** the anchor date — this is enforced at rendering time, not by convention.

Columns and tables that would allow label reconstruction via joins (e.g., `customers`, `subscriptions`, terminal-stage opportunity fields) are excluded from the public bundle entirely.
Loading
Loading