Karpathy's autoresearch loop, adapted for SaaS scoring models.
Karpathy's autoresearch runs an AI agent that edits training code, measures validation loss, keeps improvements, and repeats - hundreds of experiments overnight on a single GPU. We took the same loop and pointed it at GTM scoring models: health scores, churn risk, ICP fit. Instead of
val_bpb, we optimize for revenue retention by tier. Instead of 5-minute GPU runs, we score 5,000 accounts in 56ms. The constraints are the same: one mutable artifact, one metric, keep-or-revert, log everything.
SaaS teams change their scoring models constantly — but almost nobody logs these changes as experiments. In ML that's malpractice. In GTM scoring it's standard practice.
A typical workflow:
- VP of CS says "we should weight NPS higher"
- Someone updates a config in Gainsight or a spreadsheet
- Nobody records what the old weights were
- Nobody measures if the change improved prediction
- Three months later, another tweak on top
- Nobody can tell which version was best
This is the equivalent of editing train.py, never committing, and deleting your terminal history.
Most teams already experiment with their scoring models — they build geo-specific variants, test different feature sets, validate against revenue. The problem isn't a lack of experimentation. It's a lack of memory. Every weight change is a lost experiment because nobody logs what was tried, what it replaced, or whether it actually worked.
gtm-autoresearch adds the memory layer. Every weight change becomes a logged, comparable, reversible experiment. The engine finds better weights and proves they work — and the journal means the next experiment learns from all the previous ones instead of starting blind.
┌─────────────────────────────────────────────┐
│ Channel Execution Layer │
│ Campaign variants, HOTL governance, │
│ multi-agent orchestration, API execution │
│ → Described in the playbook (see Related) │
├─────────────────────────────────────────────┤
│ Experiment Engine (THIS REPO) │
│ Weight optimization, experiment journal, │
│ 13 GTM-native metrics, keep-or-revert │
├─────────────────────────────────────────────┤
│ Your ICP Foundation │
│ Scoring model, enrichment pipeline, │
│ segmentation, ground truth data │
│ → You build this (or already have it) │
└─────────────────────────────────────────────┘
The bottom layer is the hardest to build and the most defensible - your ICP model, your enrichment pipeline, your ground truth. This repo sits in the middle: it takes your foundation and adds experiment discipline. The channel execution layer (running campaign variants, A/B tests on email/ads/pages) sits above and is described in the playbook, not this repo.
- Send emails or run ad campaigns. This optimizes scoring models, not campaign assets.
- Replace your ICP model. You need a scoring foundation with features and known outcomes first. This makes it better.
- Require an LLM. Mutations are random weight perturbations - cheap, fast, and sufficient for the weight-space search.
- Touch your production systems. The engine reads data, runs experiments in memory, and writes to a journal file. Nothing else.
The original autoresearch is a 630-line Python harness built around three constraints: one mutable file (train.py), one evaluation metric (val_bpb), and a keep-or-revert rule enforced by git. The agent runs hundreds of experiments overnight, each bounded to 5 minutes on a single GPU.
We preserved the constraints. We changed what they operate on.
| Karpathy's autoresearch | gtm-autoresearch |
|---|---|
Mutable artifact: train.py (GPT training code) |
Mutable artifact: scoring spec (feature weights) |
| Experiment: 5-min GPU training run | Experiment: score all accounts (56ms for 5K) |
Metric: val_bpb (bits per byte) |
Metric: AUC-ROC, Revenue Capture, NRR by tier |
| Mutation: AI edits Python code | Mutation: random weight perturbation |
Rollback: git reset |
Rollback: keep-or-revert (journal preserves all) |
| Budget: 1 GPU, 5 min/experiment | Budget: 1 CPU, <100ms/experiment |
| Scale: ~100 experiments/day | Scale: ~500 experiments/second |
| Evaluation: single scalar (val_bpb) | Evaluation: 13 GTM-native metrics (NRR, GRR, HHI...) |
- The evaluation harness is read-only. The loop can change weights. It cannot change how results are measured. This is the core governance constraint that prevents metric gaming.
- Every experiment is logged. Append-only journal. Full weight snapshots. Reproducible.
- Keep-or-revert discipline. Only improvements survive.
- Speed enables exploration. Karpathy's insight: the loop must be fast enough to run hundreds of experiments, not dozens. We vectorized scoring in numpy to hit 300-500/second.
- Features are frozen during runs. Adding features is a human decision outside the loop. The agent optimizes weights on a fixed set.
- Multi-metric evaluation. Karpathy uses one scalar. GTM scoring needs NRR, churn prediction, expansion rate, and fairness across segments - simultaneously. We optimize one primary metric but report all 13 so humans can catch proxy drift.
- GTM-native metrics. Net Revenue Retention by tier, winback rate by former tier, revenue concentration (HHI). These are what RevOps teams use to make resource decisions. MLflow doesn't know what NRR is.
- No LLM required. Karpathy's agents use LLMs to edit code. Our mutations are random weight perturbations - cheap, fast, and sufficient for the weight-space search. The search space (15-25 features, integer weights 0-25) is small enough that random hill-climbing explores it thoroughly in minutes.
Generated 5,000 accounts with known planted signals (some features predict retention, others are noise). Ran 2,000 experiments in 6.8 seconds. The engine recovered all planted signals in the correct rank order:
| Signal | Planted strength | Engine found | Recovered? |
|---|---|---|---|
| alert_created | Strongest | Weight 25 (highest) | Yes |
| session_depth | 2nd | Weight 23 | Yes |
| activation_speed | 3rd | Weight 20 | Yes |
| has_exported | Moderate | Weight 10 | Yes |
| credit_limit_hits | Weak positive | Weight 6 (positive) | Yes |
| noise_feature_1 | Zero | Weight 1 (near zero) | Yes |
| noise_feature_2 | Zero | Weight 1 (near zero) | Yes |
AUC-ROC: 0.63 -> 0.75 | Tier separation: 1.38 -> 1.85 | Time: 6.8 seconds
# Reproduce this yourself:
python3 examples/health_score/generate_data.py
python3 engine/scoring_engine.py --n 2000 --metric auc_roc_retain
Also tested against a production B2B SaaS dataset (tens of thousands of accounts, multiple data sources, known retention and revenue outcomes). The engine found signal hierarchies that manual weight-tuning missed - including cases where signals assumed to be negative (user friction events) turned out to be strong positive engagement indicators. Details available on request.
If you already have account data with known outcomes (retention, churn, revenue):
# 1. Prepare a parquet file with columns: your scoring features + outcomes
# Required outcome columns (at least one): retained_6mo, churned, actual_ltv, mrr
# Feature columns: everything else (the engine auto-detects them)
# 2. Run 2,000 experiments (~7 seconds)
python3 engine/scoring_engine.py --n 2000 --metric revenue_capture_at_20 --data your_ground_truth.parquet
# 3. Review what the engine found
python3 analysis/review.py experiment_journal.jsonlNo data yet? Generate a test dataset with planted signals to see how the engine works:
# 1. Generate synthetic data (5K accounts, planted signals)
python3 examples/health_score/generate_data.py
# 2. Run 2,000 experiments (~7 seconds)
python3 engine/scoring_engine.py --n 2000 --metric auc_roc_retain
# 3. Review results - the engine should recover all planted signals
python3 analysis/review.py experiment_journal.jsonl
# 4. Run validation tests
python3 tests/test_engine.pymodel: health_score
max_score: 100
features:
alert_created:
weight: 5 # Starting weight (the engine will optimize this)
min: 0
max: 25
session_depth:
weight: 5
min: 0
max: 25
# ... your features
tiers:
A: {min: 60, max: 101}
B: {min: 35, max: 60}
C: {min: 15, max: 35}
D: {min: 0, max: 15}Or skip the spec - the engine auto-detects features from your parquet columns.
A parquet file with one row per account. Columns: features + outcomes (retained_6mo, churned, actual_ltv, mrr).
python3 engine/scoring_engine.py \
--n 2000 \
--metric auc_roc_retain \
--data your_ground_truth.parquetAvailable metrics: auc_roc_retain, auc_roc_churn, revenue_capture_at_20, tier_separation
Every experiment produces a JSONL entry:
{
"experiment_id": "exp_00034",
"metric": "auc_roc_retain",
"baseline_value": 0.6492,
"variant_value": 0.6610,
"outcome": "win",
"changes": ["alert_created: 5 -> 12", "session_depth: 5 -> 8"],
"weights_snapshot": {"alert_created": 12, "session_depth": 8},
"metrics": {
"auc_roc_retain": 0.6610,
"nrr_by_tier": {"A": 1.0, "B": 0.92, "C": 0.59, "D": 0.32},
"revenue_concentration_hhi": 3953
}
}This is what separates it from MLflow or generic experiment tracking. Every metric here is something a RevOps or CS team uses to make resource allocation decisions.
| Metric | What it answers |
|---|---|
| NRR by tier | Do higher-scored accounts retain more revenue? |
| GRR by tier | Pure retention without expansion masking it |
| Logo retention by tier | Account count retention (not revenue-weighted) |
| Expansion rate by tier | Do higher-scored accounts upsell? |
| Contraction rate by tier | Who's downgrading? (catches the "retain logo, lose revenue" pattern) |
| Quick ratio by tier | Growth efficiency per segment |
| Winback rate by former tier | Do historically high-scored churned accounts return? |
| Activation rate by tier | Do higher tiers activate faster? |
| ARPA by tier | Are tiers just proxies for plan price? |
| Revenue concentration (HHI) | Is too much MRR in one tier? (guardrail against gaming) |
| Revenue Capture @20% | Does the top 20% of scored accounts contain the most revenue? |
| AUC-ROC | Binary classification accuracy for retention/churn |
| Tier separation | LTV ratio between top and bottom quartiles |
See docs/metrics.md for detailed definitions and benchmarks.
gtm-autoresearch/
├── README.md
├── LICENSE
├── engine/
│ └── scoring_engine.py # Core: score + evaluate + experiment loop
├── analysis/
│ └── review.py # Journal analysis + comparison
├── examples/
│ ├── health_score/
│ │ └── generate_data.py # Synthetic data: retention signals
│ └── churn_risk/
│ └── generate_data.py # Synthetic data: churn decay signals
├── spec/
│ └── example_spec.json # Example scoring spec (output of a run)
├── docs/
│ ├── methodology.md # Full Karpathy adaptation explained
│ └── metrics.md # All 13 evaluation metrics defined
└── tests/
└── test_engine.py # 4 validation tests including signal recovery
- Python 3.9+
- numpy, pandas, pyarrow
- scikit-learn (for AUC-ROC)
MIT
- Autonomous GTM Experimentation - the Mazorda playbook that explains when and how to apply this pattern
- Karpathy's autoresearch - the original ML experimentation loop this project is based on
- Andrej Karpathy - for autoresearch, the autonomous experiment loop that this project adapts for GTM
- Anthropic - for Claude Code, used throughout development
Yaniv Mazor - Founder, Mazorda. GTM engineering for B2B SaaS.
Built from production work with real SaaS account data. The pattern is reusable for any SaaS company with behavioral data and known outcomes.