ProbeRoute: Probes as Routing Priors for Frozen-Backbone Multi-Token Prediction

Ali Uyar Independent Researcher

Paper title: ProbeRoute: Probes as Routing Priors for Frozen-Backbone Multi-Token Prediction

This repository accompanies a methods paper on adapting a frozen autoregressive language model for explicit multi-token prediction. The central question is sharper than a generic "better adapter" claim: does probe-initialized sparse routing improve frozen-backbone multi-token prediction relative to the strongest dense baseline selected under the same screening and final-rerun protocol? ProbeRoute probes every layer for future-token predictability, uses the resulting scores to initialize a sparse per-horizon router, and trains lightweight multi-token heads on top of the frozen backbone.

Abstract

Frozen autoregressive language models are trained for next-token prediction, but their hidden states can still encode information about several future tokens. ProbeRoute tests whether that latent structure can be converted into a better explicit multi-token adapter without unfreezing the backbone. The method first runs a one-time offline probe stage across depth, then uses probe-derived top-5 scores to initialize a sparse top-m router over frozen hidden states. Under a stage-gated protocol with mandatory probes, screening baselines, a finalist-selection step, and a final 1B rerun, the resulting sparse adapter beats the strongest selected dense frozen-backbone baseline on the paper's two headline held-out metrics. On the final 1B comparison, ProbeRoute improves test top-1 (h2-4) from 0.1162 to 0.1172 and the speculative draft acceptance proxy from 1.1110 to 1.1188, while changing test NLL (h1-4) only from 5.3769 to 5.3780. Probe heatmaps reveal horizon-dependent depth structure at both 410M and 1B, the learned sparse router concentrates mass on the same depth bands, and the random-initialization ablation is weaker than probe-initialized sparse routing. By contrast, loss warmup and deeper far-horizon heads do not improve over the base sparse configuration at the tested budget. In this frozen-backbone setting, future-token probes are not merely descriptive diagnostics; they are a useful routing prior for explicit multi-token prediction.

Main Result

The final 1B comparison at the 50M-token budget — both finalists using probe-derived initialization — is the paper's central quantitative result:

Model	Test top-1 (h2-4)	Test accept len.	Test NLL (h1-4)
Dense finalist	0.1162	1.1110	5.3769
ProbeRoute (ours)	0.1172	1.1188	5.3780
Delta (sparse-dense)	+0.00099	+0.00781	+0.00110

The raw deltas are small but directionally consistent: top-1 improves at every horizon, the speculative draft acceptance proxy improves, and aggregate NLL changes only minimally (about 0.02% relative). The practical point is stronger than the raw magnitude — the gain comes from a more selective routing interface, not from a heavier backbone adaptation recipe. The random-init ablation weakens the sparse model; loss warmup and deeper far-horizon heads do not improve over the base sparse configuration at the tested budget.

Contributions

We show that future-token probe heatmaps reveal horizon-dependent depth structure at both 410M and 1B, with shorter horizons peaking later and farther horizons shifting earlier in depth.
We turn those probe measurements into a concrete frozen-backbone adapter — a sparse top-m layer router initialized from probe top-5 scores — and evaluate it against screening-selected last-layer and dense weighted-hidden-state baselines.
We find that the final 1B sparse run beats the strongest selected dense baseline on held-out future-token top-1 and speculative-draft acceptance metrics while relying on a more selective routing interface, and the random-init ablation is the one that meaningfully weakens the result.

Scope

This study is deliberately focused.

one frozen-backbone setting; the backbone is never unfrozen
two scales: 410M for probes and screening diagnostics, 1B for the final comparison
one final-budget comparison against a screening-selected dense finalist, not a broad sweep
explicit multi-token prediction with held-out top-1 and speculative-draft acceptance as the headline metrics

It does not claim that sparse routing universally dominates dense routing, that frozen adapters beat full finetuning, or that the acceptance proxy is a throughput benchmark.

Paper

Compiled PDF: ProbeRoute_paper.pdf
LaTeX source: paper/incoming_latex/main.tex
Figures and tables: paper/figures/, paper/tables/, paper/appendix/
TMLR submission archive: paper/layermix_mtp_paper_tmlr_source.zip

Repository Layout

src/ — implementation code: probes, sparse routing, MTP heads, training, evaluation
configs/ — experiment entrypoints (smoke, screening, final-budget runs)
schemas/ — result and manifest schemas
fixtures/ — offline smoke-test data
tests/ — unit and integration tests
paper/ — manuscript source, figures, tables, and compiled PDF
outputs/ — run outputs (gitignored; local only)

Reproducibility

ARCHITECTURE.md — system architecture and data flow
DATA_SPEC.md — data formats and preparation contracts
ENVIRONMENT.md — pinned runtime environment
RESULTS_SCHEMA.md — structure of run outputs and registries
MODULE_SPECS/ — contributor-facing module contracts
CITATION.cff — citation metadata

Citation

@software{uyar2026proberoute,
  author  = {Uyar, Ali},
  title   = {ProbeRoute: Probes as Routing Priors for Frozen-Backbone Multi-Token Prediction},
  year    = {2026},
  doi     = {10.5281/zenodo.19022709},
  url     = {https://github.com/aliuyar1234/proberoute}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
MODULE_SPECS		MODULE_SPECS
PAPER_TEMPLATES		PAPER_TEMPLATES
configs		configs
fixtures		fixtures
paper		paper
schemas		schemas
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CITATION.cff		CITATION.cff
DATA_SPEC.md		DATA_SPEC.md
ENVIRONMENT.md		ENVIRONMENT.md
LICENSE		LICENSE
Makefile		Makefile
ProbeRoute_paper.pdf		ProbeRoute_paper.pdf
README.md		README.md
RESULTS_SCHEMA.md		RESULTS_SCHEMA.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-torch-cu128.txt		requirements-torch-cu128.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProbeRoute: Probes as Routing Priors for Frozen-Backbone Multi-Token Prediction

Abstract

Main Result

Contributions

Scope

Paper

Repository Layout

Reproducibility

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProbeRoute: Probes as Routing Priors for Frozen-Backbone Multi-Token Prediction

Abstract

Main Result

Contributions

Scope

Paper

Repository Layout

Reproducibility

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages