Skip to content

Commit 5ed1926

Browse files
author
Henry Wallace
committed
docs: rename MAKING_IT_GOOD to CALIBRATION_AND_RANKING_GUIDE
- Replace MAKING_IT_GOOD_MINIMAL_HEURISTICS.md with CALIBRATION_AND_RANKING_GUIDE.md (neutral title) - README and DOCS_SUMMARY point to new guide; mention just fit-calibration / eval-en
1 parent f057c9d commit 5ed1926

4 files changed

Lines changed: 66 additions & 66 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,4 +172,4 @@ Start with:
172172
- `docs/PROJECT_OVERVIEW.md`
173173
- `docs/guides/QUICK_START.md`
174174
- `docs/guides/TRAINING_GUIDE.md`
175-
- `docs/guides/MAKING_IT_GOOD_MINIMAL_HEURISTICS.md`improve calibration and ranking with minimal heuristics (frequency sampling, differentiable Spearman, learned calibration). **Calibration:** `uv run python scripts/fit_calibration.py --model models/<name>.pt --data data/word_frequency.csv` then `--calibration <name>.pt.cal.json` in predict or evaluate_model.
175+
- `docs/guides/CALIBRATION_AND_RANKING_GUIDE.md` — calibration and ranking: frequency-weighted sampling, differentiable Spearman (soft ranking), learned affine calibration. Use `just fit-calibration` then `just eval-en` or `evaluate_model.py --calibration <name>.pt.cal.json`.
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Calibration and ranking guide
2+
3+
Research-backed improvements for ICF calibration and ranking quality, with minimal heuristics (no hand-picked anchor words or ad-hoc sample weights).
4+
5+
---
6+
7+
## 1. Sample from the data distribution
8+
9+
**Problem:** The head of the distribution ("the", "and") is under-represented in training. Stratified sampling uses **uniform within each stratum**, so "the" appears no more often than other head words.
10+
11+
**Fix:** Sample **weighted by token frequency** within strata, with **replacement** so high-count words can appear many times per epoch. The model then sees head words in proportion to the data; gradients come from the real distribution.
12+
13+
**Where:** `data.stratified_sample` with `use_token_frequency=True` (and `replace=True`); `lightning_data_multi_task.py` passes `word_counts` and `use_token_frequency=True`. No hand-picked word list.
14+
15+
---
16+
17+
## 2. Optimize Spearman directly (differentiable soft ranking)
18+
19+
**Problem:** Spearman is reported as a metric but the loss uses proxies (pairwise or sigmoid-based), so we do not directly optimize what we report.
20+
21+
**Fix:** **Differentiable Spearman** via soft sorting: Blondel et al., "Fast Differentiable Sorting and Ranking", ICML 2020 ([arxiv 2002.08871](https://arxiv.org/abs/2002.08871)). Loss \( \frac{1}{2}\|r - r_\Psi(\theta)\|^2 \) with soft ranks \( r_\Psi \). Implementations: **torchsort** (O(n log n)), **diffsort** (O(n²(log n)²)).
22+
23+
**Implemented:** `loss_unified.spearman_loss_tensor` with `spearman_method="auto"`: torchsort if available, else diffsort (default dependency), else built-in soft_rank. Training logs `Spearman loss backend: <torchsort|diffsort|built-in>`. CLI: `--spearman-reg-strength`, `--spearman-method auto|torchsort|diffsort|sigmoid`. See `docs/SPEARMAN_LOSS_BACKENDS.md` for backend details.
24+
25+
---
26+
27+
## 3. Calibration learned from validation
28+
29+
**Problem:** Common words (e.g. "the") can be over-predicted (too rare). Anchor-word fixes are heuristic.
30+
31+
**Fix:** **Affine calibration** on a held-out set: fit \( \hat{y} = a + b \cdot \text{pred} \) to minimize MSE; apply at inference.
32+
33+
**Implemented:** `scripts/fit_calibration.py` fits (a, b) on a fraction of data (default 20%), writes `<model>.pt.cal.json`. `tiny_icf.calibration`: `load_calibration`, `save_calibration`, `apply_affine`. Use `--calibration <path>` in `tiny_icf-predict` and `evaluate_model.py`.
34+
35+
**Usage:** `just fit-calibration MODEL=models/<name>.pt DATA=data/word_frequency.csv` then `just eval-en` / `just eval-en-spearman` or `evaluate_model.py --calibration <name>.pt.cal.json`.
36+
37+
---
38+
39+
## 4. Optional: listwise ranking in the loss
40+
41+
**Problem:** Pairwise ranking does not enforce global order. Listwise losses (e.g. soft-rank MSE over a batch) align better with Spearman.
42+
43+
**Fix:** Use listwise options (`use_listwise_ranking` in the criterion); ensure batch size is large enough so that within-batch order is meaningful. Complements (2); no extra heuristics.
44+
45+
---
46+
47+
## 5. What to avoid
48+
49+
- **Anchor-word losses:** Prefer (1) + (3) over a fixed list of words and target ICF.
50+
- **Hand-designed sample weights** (e.g. \( 1/\sqrt{\text{count}} \)): Prefer (1) — sample by actual frequency.
51+
- **Synthetic OOV with a single hand-set target (e.g. 0.95):** Use a discriminative objective or derive targets from data; avoid one magic number for all OOV.
52+
53+
---
54+
55+
## Implementation order
56+
57+
1. **Frequency-weighted sampling** (1): datamodule change; no new deps; immediate effect on head calibration.
58+
2. **Spearman backend** (2): ensure backend is torchsort or diffsort and `spearman_weight` is non-trivial; tune regularization if needed.
59+
3. **Learned calibration** (3): fit (a, b) on val; apply at inference via `--calibration`.
60+
4. **Listwise / larger batches** (4): if (1)–(3) are insufficient, enable listwise and/or increase batch size.
61+
62+
---
63+
64+
Evidence: Blondel et al. arxiv 2002.08871; codebase: `stratified_sample`, `use_token_frequency`, `SpearmanLoss` backends in `loss_unified` and `loss.py`.

docs/guides/DOCS_SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
- `../docs/guides/DATA_AND_MODELS.md` - Where data/models come from (repo is intentionally lean)
77
- `../docs/guides/QUICK_START.md` - Quick start workflow
88
- `../docs/guides/TRAINING_GUIDE.md` - Detailed training guide
9+
- `../docs/guides/CALIBRATION_AND_RANKING_GUIDE.md` - Calibration and ranking (frequency sampling, differentiable Spearman, learned calibration)
910
- `../docs/guides/EPHEMERAL_TRAINING.md` - Training on ephemeral environments (RunPod)
1011
- `../docs/results/EXPERIMENTS.md` - Experiment history and results
1112

docs/guides/MAKING_IT_GOOD_MINIMAL_HEURISTICS.md

Lines changed: 0 additions & 65 deletions
This file was deleted.

0 commit comments

Comments
 (0)