Skip to content

Llama phase split quantisation support#320

Open
YuxuanHan0326 wants to merge 6 commits into
mainfrom
yx/phase-split-quantisation-mainline
Open

Llama phase split quantisation support#320
YuxuanHan0326 wants to merge 6 commits into
mainfrom
yx/phase-split-quantisation-mainline

Conversation

@YuxuanHan0326
Copy link
Copy Markdown
Collaborator

PR Summary (vs main)

  1. Phase-split config for the Llama quantized path
  • Quant configs can be defined separately for prefill and decode.
  • Runtime phase is inferred per decoder-layer forward and propagated via context, so modules run with the correct phase config.
  1. Decode behavior is explicitly controlled
  • decode_policy="fp_only" keeps decode in FP-compatible mode.
  • decode_policy="quantized" enables decode quantization using decode sub-config.
  • A fail-fast guard prevents mixed decode policies across quantized Llama modules.
  1. Core quantized Llama modules are phase-aware
  • attention, mlp, rms_norm, and MX linear modules now consume runtime phase + per-phase config at inference time.
  • Supports practical modes: prefill-only quant, decode-only quant, and different prefill/decode configs.
  1. MX linear decode bank path is integrated
  • LinearMXFP/LinearMXInt select decode-path banks based on phase/policy.
  • Decode bank refresh remains compatible with GPTQ + module replacement flow.
  1. Compatibility and stability
  • Legacy flat quant configs are still accepted (internally normalized to phase structure).
  • Rotate symbol exports were restored to preserve existing import/registry contracts.
  • Targeted split-phase tests pass, and real Llama+GPTQ chain checks produced readable outputs across all three phase combinations.
  1. Scope boundary (not changed in this PR)
  • Llama-focused only (no extension to other model families).
  • No runtime re-quantization feature so currently only working for small models.
  • No BFCL/evaluation pipeline changes.
  • Existing non-phase quant workflows are intentionally left unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant