Skip to content

Commit f057c9d

Browse files
author
Henry Wallace
committed
docs: update DATA_AND_MODELS with en eval metrics; add eval-en targets; calibration workflow
- DATA_AND_MODELS: multitask_en 61.5% Jabberwocky, 0.12 MAE, 0.18 Spearman; best_spearman 76.9% Jabberwocky - Justfile: eval-en, eval-en-spearman (eval with calibration, max-samples 5000) - Docs: calibration section references just fit-calibration and just eval-en
1 parent 555dcb6 commit f057c9d

4 files changed

Lines changed: 13 additions & 20 deletions

File tree

docs/guides/DATA_AND_MODELS.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -62,12 +62,12 @@ tiny-icf/
6262
| `multitask_all_fronts_v3.pt` | 46% | 0.26 | 0.14 | OOV calibration, pseudo-words |
6363
| `multitask_all_fronts_v3b.pt` | 31% | 0.09 | 0.18–0.29 | Dataset fit, ranking (best ckpt ep28). With calibration: MAE 0.078, Spearman 0.29 |
6464
| `multitask_all_fronts_v4.pt` | 31% | 0.28 | 0.07 | Better "the"/common-word calibration (124K params) |
65-
| `multitask_en.pt` | 62% | 0.12 | 0.08 | English-only; freq-weighted sampling (best ep by val_loss). With calibration: MAE 0.12, Jabberwocky 62%. Head words in band. Use for low MAE. |
66-
| `multitask_en_best_spearman.pt` | | | | Same training run, checkpoint with **best val_spearman_corr** (not best val_loss). Use when ranking/ordering matters more than MAE. Created when using `--export-best-by-spearman`. |
65+
| `multitask_en.pt` | 61.5% | 0.12 | 0.18 | English-only; freq-weighted sampling (best ep by val_loss). With calibration: MAE 0.12, Spearman 0.18. Head words in band. Use for low MAE. |
66+
| `multitask_en_best_spearman.pt` | 76.9% | 0.12 | 0.15 | Same run, checkpoint with **best val_spearman_corr**. With calibration: better Jabberwocky (77%), similar MAE. Use when OOV/gibberish discrimination matters. Created with `--export-best-by-spearman`. |
6767

6868
Download from S3: `aws s3 cp s3://arclabs-backups/tiny-icf/models/<name>.pt models/`
6969

70-
**Calibration:** Fit affine calibration for better MAE: `uv run python scripts/fit_calibration.py --model models/<name>.pt --data data/word_frequency.csv` → writes `<name>.pt.cal.json`. Use `--calibration models/<name>.pt.cal.json` with predict or evaluate_model. Pre-fit calibration for v3b is on S3: `aws s3 cp s3://arclabs-backups/tiny-icf/models/multitask_all_fronts_v3b.pt.cal.json models/`.
70+
**Calibration:** Fit affine calibration for better MAE: `just fit-calibration MODEL=models/<name>.pt DATA=data/word_frequency.csv` (or `uv run python scripts/fit_calibration.py ...`) → writes `<name>.pt.cal.json`. Eval with calibration: `just eval-en` / `just eval-en-spearman` or `uv run python scripts/evaluate_model.py --model ... --data ... --calibration <name>.pt.cal.json`. Pre-fit calibration for v3b is on S3: `aws s3 cp s3://arclabs-backups/tiny-icf/models/multitask_all_fronts_v3b.pt.cal.json models/`.
7171

7272
Sync to S3: `just sync-s3` (or `aws s3 sync models/ s3://arclabs-backups/tiny-icf/models/ --exclude "*" --include "multitask_*.pt" --include "v3_base*.pt" --include "*.pt.cal.json"`). After training export and optional `just fit-calibration`, run sync to upload.
7373

docs/guides/MAKING_IT_GOOD_MINIMAL_HEURISTICS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Research-backed, low-heuristic improvements. No hand-picked anchor words or ad-h
3535
- `tiny_icf.calibration``load_calibration`, `save_calibration`, `apply_affine`.
3636
- `tiny_icf-predict --calibration <path>` and `evaluate_model.py --calibration <path>` apply the affine map to predictions.
3737

38-
Usage: `uv run python scripts/fit_calibration.py --model models/multitask_all_fronts_v3b.pt --data data/word_frequency.csv` then pass `--calibration models/multitask_all_fronts_v3b.pt.cal.json` to predict or evaluate_model.
38+
Usage: `just fit-calibration MODEL=models/<name>.pt DATA=data/word_frequency.csv` then `just eval-en` / `just eval-en-spearman` or `evaluate_model.py --calibration <name>.pt.cal.json`.
3939

4040
---
4141

justfile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,12 @@ fit-calibration MODEL="models/multitask_all_fronts_v3b.pt" DATA="data/word_frequ
5555
eval-v3b-cal MODEL="models/multitask_all_fronts_v3b.pt" DATA="data/word_frequency.csv" CAL="models/multitask_all_fronts_v3b.pt.cal.json":
5656
uv run python scripts/evaluate_model.py --model {{MODEL}} --data {{DATA}} --calibration {{CAL}}
5757

58+
# Eval English-only models (with calibration if .cal.json exists)
59+
eval-en MODEL="models/multitask_en.pt" DATA="data/word_frequency.csv" CAL="models/multitask_en.pt.cal.json":
60+
uv run python scripts/evaluate_model.py --model {{MODEL}} --data {{DATA}} --calibration {{CAL}} --max-samples 5000
61+
eval-en-spearman MODEL="models/multitask_en_best_spearman.pt" DATA="data/word_frequency.csv" CAL="models/multitask_en_best_spearman.pt.cal.json":
62+
uv run python scripts/evaluate_model.py --model {{MODEL}} --data {{DATA}} --calibration {{CAL}} --max-samples 5000
63+
5864
# Debug head-word predictions (base vs lang correction; --data shows target ICF)
5965
debug-the MODEL="models/multitask_all_fronts_v3b.pt" DATA="data/word_frequency.csv" *args:
6066
uv run python scripts/debug_the_prediction.py --model {{MODEL}} --data {{DATA}} {{args}}

uv.lock

Lines changed: 3 additions & 16 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)