Ali Uyar Independent Researcher
Paper title: ProbeRoute: Probes as Routing Priors for Frozen-Backbone Multi-Token Prediction
This repository accompanies a methods paper on adapting a frozen autoregressive language model for explicit multi-token prediction. The central question is sharper than a generic "better adapter" claim: does probe-initialized sparse routing improve frozen-backbone multi-token prediction relative to the strongest dense baseline selected under the same screening and final-rerun protocol? ProbeRoute probes every layer for future-token predictability, uses the resulting scores to initialize a sparse per-horizon router, and trains lightweight multi-token heads on top of the frozen backbone.
Frozen autoregressive language models are trained for next-token prediction, but their hidden states can still encode information about several future tokens. ProbeRoute tests whether that latent structure can be converted into a better explicit multi-token adapter without unfreezing the backbone. The method first runs a one-time offline probe stage across depth, then uses probe-derived top-5 scores to initialize a sparse top-m router over frozen hidden states. Under a stage-gated protocol with mandatory probes, screening baselines, a finalist-selection step, and a final 1B rerun, the resulting sparse adapter beats the strongest selected dense frozen-backbone baseline on the paper's two headline held-out metrics. On the final 1B comparison, ProbeRoute improves test top-1 (h2-4) from 0.1162 to 0.1172 and the speculative draft acceptance proxy from 1.1110 to 1.1188, while changing test NLL (h1-4) only from 5.3769 to 5.3780. Probe heatmaps reveal horizon-dependent depth structure at both 410M and 1B, the learned sparse router concentrates mass on the same depth bands, and the random-initialization ablation is weaker than probe-initialized sparse routing. By contrast, loss warmup and deeper far-horizon heads do not improve over the base sparse configuration at the tested budget. In this frozen-backbone setting, future-token probes are not merely descriptive diagnostics; they are a useful routing prior for explicit multi-token prediction.
The final 1B comparison at the 50M-token budget — both finalists using probe-derived initialization — is the paper's central quantitative result:
| Model | Test top-1 (h2-4) | Test accept len. | Test NLL (h1-4) |
|---|---|---|---|
| Dense finalist | 0.1162 | 1.1110 | 5.3769 |
| ProbeRoute (ours) | 0.1172 | 1.1188 | 5.3780 |
| Delta (sparse-dense) | +0.00099 | +0.00781 | +0.00110 |
The raw deltas are small but directionally consistent: top-1 improves at every horizon, the speculative draft acceptance proxy improves, and aggregate NLL changes only minimally (about 0.02% relative). The practical point is stronger than the raw magnitude — the gain comes from a more selective routing interface, not from a heavier backbone adaptation recipe. The random-init ablation weakens the sparse model; loss warmup and deeper far-horizon heads do not improve over the base sparse configuration at the tested budget.
- We show that future-token probe heatmaps reveal horizon-dependent depth structure at both 410M and 1B, with shorter horizons peaking later and farther horizons shifting earlier in depth.
- We turn those probe measurements into a concrete frozen-backbone adapter — a sparse top-m layer router initialized from probe top-5 scores — and evaluate it against screening-selected last-layer and dense weighted-hidden-state baselines.
- We find that the final 1B sparse run beats the strongest selected dense baseline on held-out future-token top-1 and speculative-draft acceptance metrics while relying on a more selective routing interface, and the random-init ablation is the one that meaningfully weakens the result.
This study is deliberately focused.
- one frozen-backbone setting; the backbone is never unfrozen
- two scales: 410M for probes and screening diagnostics, 1B for the final comparison
- one final-budget comparison against a screening-selected dense finalist, not a broad sweep
- explicit multi-token prediction with held-out top-1 and speculative-draft acceptance as the headline metrics
It does not claim that sparse routing universally dominates dense routing, that frozen adapters beat full finetuning, or that the acceptance proxy is a throughput benchmark.
- Compiled PDF:
ProbeRoute_paper.pdf - LaTeX source:
paper/incoming_latex/main.tex - Figures and tables:
paper/figures/,paper/tables/,paper/appendix/ - TMLR submission archive:
paper/layermix_mtp_paper_tmlr_source.zip
src/— implementation code: probes, sparse routing, MTP heads, training, evaluationconfigs/— experiment entrypoints (smoke, screening, final-budget runs)schemas/— result and manifest schemasfixtures/— offline smoke-test datatests/— unit and integration testspaper/— manuscript source, figures, tables, and compiled PDFoutputs/— run outputs (gitignored; local only)
ARCHITECTURE.md— system architecture and data flowDATA_SPEC.md— data formats and preparation contractsENVIRONMENT.md— pinned runtime environmentRESULTS_SCHEMA.md— structure of run outputs and registriesMODULE_SPECS/— contributor-facing module contractsCITATION.cff— citation metadata
@software{uyar2026proberoute,
author = {Uyar, Ali},
title = {ProbeRoute: Probes as Routing Priors for Frozen-Backbone Multi-Token Prediction},
year = {2026},
doi = {10.5281/zenodo.19022709},
url = {https://github.com/aliuyar1234/proberoute}
}