docs: update DATA_AND_MODELS with en eval metrics; add eval-en targets; calibration workflow

Henry Wallace · Henry Wallace · commit f057c9da33bc · 2026-02-23T21:36:47.000-05:00
- DATA_AND_MODELS: multitask_en 61.5% Jabberwocky, 0.12 MAE, 0.18 Spearman; best_spearman 76.9% Jabberwocky
- Justfile: eval-en, eval-en-spearman (eval with calibration, max-samples 5000)
- Docs: calibration section references just fit-calibration and just eval-en
diff --git a/docs/guides/DATA_AND_MODELS.md b/docs/guides/DATA_AND_MODELS.md
@@ -62,12 +62,12 @@ tiny-icf/
 | `multitask_all_fronts_v3.pt` | 46% | 0.26 | 0.14 | OOV calibration, pseudo-words |
 | `multitask_all_fronts_v3b.pt` | 31% | 0.09 | 0.18–0.29 | Dataset fit, ranking (best ckpt ep28). With calibration: MAE 0.078, Spearman 0.29 |
 | `multitask_all_fronts_v4.pt` | 31% | 0.28 | 0.07 | Better "the"/common-word calibration (124K params) |
-| `multitask_en.pt` | 62% | 0.12 | 0.08 | English-only; freq-weighted sampling (best ep by val_loss). With calibration: MAE 0.12, Jabberwocky 62%. Head words in band. Use for low MAE. |
-| `multitask_en_best_spearman.pt` | — | — | — | Same training run, checkpoint with **best val_spearman_corr** (not best val_loss). Use when ranking/ordering matters more than MAE. Created when using `--export-best-by-spearman`. |
+| `multitask_en.pt` | 61.5% | 0.12 | 0.18 | English-only; freq-weighted sampling (best ep by val_loss). With calibration: MAE 0.12, Spearman 0.18. Head words in band. Use for low MAE. |
+| `multitask_en_best_spearman.pt` | 76.9% | 0.12 | 0.15 | Same run, checkpoint with **best val_spearman_corr**. With calibration: better Jabberwocky (77%), similar MAE. Use when OOV/gibberish discrimination matters. Created with `--export-best-by-spearman`. |
 
 Download from S3: `aws s3 cp s3://arclabs-backups/tiny-icf/models/<name>.pt models/`
 
-**Calibration:** Fit affine calibration for better MAE: `uv run python scripts/fit_calibration.py --model models/<name>.pt --data data/word_frequency.csv` → writes `<name>.pt.cal.json`. Use `--calibration models/<name>.pt.cal.json` with predict or evaluate_model. Pre-fit calibration for v3b is on S3: `aws s3 cp s3://arclabs-backups/tiny-icf/models/multitask_all_fronts_v3b.pt.cal.json models/`.
+**Calibration:** Fit affine calibration for better MAE: `just fit-calibration MODEL=models/<name>.pt DATA=data/word_frequency.csv` (or `uv run python scripts/fit_calibration.py ...`) → writes `<name>.pt.cal.json`. Eval with calibration: `just eval-en` / `just eval-en-spearman` or `uv run python scripts/evaluate_model.py --model ... --data ... --calibration <name>.pt.cal.json`. Pre-fit calibration for v3b is on S3: `aws s3 cp s3://arclabs-backups/tiny-icf/models/multitask_all_fronts_v3b.pt.cal.json models/`.
 
 Sync to S3: `just sync-s3` (or `aws s3 sync models/ s3://arclabs-backups/tiny-icf/models/ --exclude "*" --include "multitask_*.pt" --include "v3_base*.pt" --include "*.pt.cal.json"`). After training export and optional `just fit-calibration`, run sync to upload.
 
diff --git a/docs/guides/MAKING_IT_GOOD_MINIMAL_HEURISTICS.md b/docs/guides/MAKING_IT_GOOD_MINIMAL_HEURISTICS.md
@@ -35,7 +35,7 @@ Research-backed, low-heuristic improvements. No hand-picked anchor words or ad-h
 - `tiny_icf.calibration` — `load_calibration`, `save_calibration`, `apply_affine`.
 - `tiny_icf-predict --calibration <path>` and `evaluate_model.py --calibration <path>` apply the affine map to predictions.
 
-Usage: `uv run python scripts/fit_calibration.py --model models/multitask_all_fronts_v3b.pt --data data/word_frequency.csv` then pass `--calibration models/multitask_all_fronts_v3b.pt.cal.json` to predict or evaluate_model.
+Usage: `just fit-calibration MODEL=models/<name>.pt DATA=data/word_frequency.csv` then `just eval-en` / `just eval-en-spearman` or `evaluate_model.py --calibration <name>.pt.cal.json`.
 
 ---
 
diff --git a/justfile b/justfile
@@ -55,6 +55,12 @@ fit-calibration MODEL="models/multitask_all_fronts_v3b.pt" DATA="data/word_frequ
 eval-v3b-cal MODEL="models/multitask_all_fronts_v3b.pt" DATA="data/word_frequency.csv" CAL="models/multitask_all_fronts_v3b.pt.cal.json":
     uv run python scripts/evaluate_model.py --model {{MODEL}} --data {{DATA}} --calibration {{CAL}}
 
+# Eval English-only models (with calibration if .cal.json exists)
+eval-en MODEL="models/multitask_en.pt" DATA="data/word_frequency.csv" CAL="models/multitask_en.pt.cal.json":
+    uv run python scripts/evaluate_model.py --model {{MODEL}} --data {{DATA}} --calibration {{CAL}} --max-samples 5000
+eval-en-spearman MODEL="models/multitask_en_best_spearman.pt" DATA="data/word_frequency.csv" CAL="models/multitask_en_best_spearman.pt.cal.json":
+    uv run python scripts/evaluate_model.py --model {{MODEL}} --data {{DATA}} --calibration {{CAL}} --max-samples 5000
+
 # Debug head-word predictions (base vs lang correction; --data shows target ICF)
 debug-the MODEL="models/multitask_all_fronts_v3b.pt" DATA="data/word_frequency.csv" *args:
     uv run python scripts/debug_the_prediction.py --model {{MODEL}} --data {{DATA}} {{args}}
diff --git a/uv.lock b/uv.lock