|
| 1 | +# Calibration and ranking guide |
| 2 | + |
| 3 | +Research-backed improvements for ICF calibration and ranking quality, with minimal heuristics (no hand-picked anchor words or ad-hoc sample weights). |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## 1. Sample from the data distribution |
| 8 | + |
| 9 | +**Problem:** The head of the distribution ("the", "and") is under-represented in training. Stratified sampling uses **uniform within each stratum**, so "the" appears no more often than other head words. |
| 10 | + |
| 11 | +**Fix:** Sample **weighted by token frequency** within strata, with **replacement** so high-count words can appear many times per epoch. The model then sees head words in proportion to the data; gradients come from the real distribution. |
| 12 | + |
| 13 | +**Where:** `data.stratified_sample` with `use_token_frequency=True` (and `replace=True`); `lightning_data_multi_task.py` passes `word_counts` and `use_token_frequency=True`. No hand-picked word list. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## 2. Optimize Spearman directly (differentiable soft ranking) |
| 18 | + |
| 19 | +**Problem:** Spearman is reported as a metric but the loss uses proxies (pairwise or sigmoid-based), so we do not directly optimize what we report. |
| 20 | + |
| 21 | +**Fix:** **Differentiable Spearman** via soft sorting: Blondel et al., "Fast Differentiable Sorting and Ranking", ICML 2020 ([arxiv 2002.08871](https://arxiv.org/abs/2002.08871)). Loss \( \frac{1}{2}\|r - r_\Psi(\theta)\|^2 \) with soft ranks \( r_\Psi \). Implementations: **torchsort** (O(n log n)), **diffsort** (O(n²(log n)²)). |
| 22 | + |
| 23 | +**Implemented:** `loss_unified.spearman_loss_tensor` with `spearman_method="auto"`: torchsort if available, else diffsort (default dependency), else built-in soft_rank. Training logs `Spearman loss backend: <torchsort|diffsort|built-in>`. CLI: `--spearman-reg-strength`, `--spearman-method auto|torchsort|diffsort|sigmoid`. See `docs/SPEARMAN_LOSS_BACKENDS.md` for backend details. |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +## 3. Calibration learned from validation |
| 28 | + |
| 29 | +**Problem:** Common words (e.g. "the") can be over-predicted (too rare). Anchor-word fixes are heuristic. |
| 30 | + |
| 31 | +**Fix:** **Affine calibration** on a held-out set: fit \( \hat{y} = a + b \cdot \text{pred} \) to minimize MSE; apply at inference. |
| 32 | + |
| 33 | +**Implemented:** `scripts/fit_calibration.py` fits (a, b) on a fraction of data (default 20%), writes `<model>.pt.cal.json`. `tiny_icf.calibration`: `load_calibration`, `save_calibration`, `apply_affine`. Use `--calibration <path>` in `tiny_icf-predict` and `evaluate_model.py`. |
| 34 | + |
| 35 | +**Usage:** `just fit-calibration MODEL=models/<name>.pt DATA=data/word_frequency.csv` then `just eval-en` / `just eval-en-spearman` or `evaluate_model.py --calibration <name>.pt.cal.json`. |
| 36 | + |
| 37 | +--- |
| 38 | + |
| 39 | +## 4. Optional: listwise ranking in the loss |
| 40 | + |
| 41 | +**Problem:** Pairwise ranking does not enforce global order. Listwise losses (e.g. soft-rank MSE over a batch) align better with Spearman. |
| 42 | + |
| 43 | +**Fix:** Use listwise options (`use_listwise_ranking` in the criterion); ensure batch size is large enough so that within-batch order is meaningful. Complements (2); no extra heuristics. |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## 5. What to avoid |
| 48 | + |
| 49 | +- **Anchor-word losses:** Prefer (1) + (3) over a fixed list of words and target ICF. |
| 50 | +- **Hand-designed sample weights** (e.g. \( 1/\sqrt{\text{count}} \)): Prefer (1) — sample by actual frequency. |
| 51 | +- **Synthetic OOV with a single hand-set target (e.g. 0.95):** Use a discriminative objective or derive targets from data; avoid one magic number for all OOV. |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +## Implementation order |
| 56 | + |
| 57 | +1. **Frequency-weighted sampling** (1): datamodule change; no new deps; immediate effect on head calibration. |
| 58 | +2. **Spearman backend** (2): ensure backend is torchsort or diffsort and `spearman_weight` is non-trivial; tune regularization if needed. |
| 59 | +3. **Learned calibration** (3): fit (a, b) on val; apply at inference via `--calibration`. |
| 60 | +4. **Listwise / larger batches** (4): if (1)–(3) are insufficient, enable listwise and/or increase batch size. |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +Evidence: Blondel et al. arxiv 2002.08871; codebase: `stratified_sample`, `use_token_frequency`, `SpearmanLoss` backends in `loss_unified` and `loss.py`. |
0 commit comments