Zhao Chenran, The Chinese University of Hong Kong, Shenzhen
This repository documents my ongoing research on how FP8 activation quantization changes training dynamics in a sparse large language model, with a focus on DeepSeek-MoE-2B under controlled checkpoint-based BF16/FP8 comparison.
The current March update is not centered on benchmark accuracy or downstream evaluation. Instead, it studies how reduced precision changes:
- loss-level alignment,
- gradient perturbation magnitude,
- relative error over training,
- module-dependent sensitivity,
- and the redistribution of perturbation across parameters.
This phase of the project is organized around four empirical questions:
-
Module sensitivity
Do MoE and Attention respond differently to the same FP8 activation quantization strategy? -
Depth sensitivity
How different are single-layer quantization and continuous multi-layer quantization (layers 0–8)? -
Format sensitivity
Does E4M3 stay better aligned with BF16 than E5M2? -
Perturbation redistribution
If the final loss remains well aligned, does the perturbation still migrate across parameters and modules during training?
All current results are based on DeepSeek-MoE-2B.
- Model: DeepSeek-MoE-2B
- Training precision: BF16
- Optimizer: AdamW
- Hardware: 4 × A100 80GB
- Parallelism: DDP
- Dataset: C4 English
- Training horizon shown here: checkpoints from iter 0 to iter 10000
- Quantization target: activations only
- Weights: kept in BF16
- Optimizer state: kept in FP32
- Formats: FP8 E4M3, FP8 E5M2
- Locations: MoE final outputs, Attention final outputs
- Depth settings: layer-0 only, or continuous quantization across layers 0–8
For each baseline checkpoint, BF16 and FP8 are evaluated on the same input batch with matched forward/backward comparison.
The main metrics are:
- Loss
- Gradient difference norm
||grad_fp8 - grad_bf16|| - Relative error
- Weight norm (baseline reference)
For sparse MoE comparisons, the global gradient difference norm uses a union-based construction over parameter sets, with non-activated parameters zero-filled before the global norm is computed.
The March results suggest that Attention and MoE do not exhibit the same perturbation profile under FP8 activation quantization.
- Attention is closer to an early-strong / later-decaying pattern.
- MoE is more consistent with a bounded but persistent nonzero perturbation pattern.
This means the same low-precision rule should not be summarized as a single universal behavior across modules.
Within the current setup, E4M3 consistently appears more stable than E5M2.
A cautious reading of the current evidence is:
- E4M3: preserves stronger BF16 alignment while keeping perturbations bounded.
- E5M2: often still preserves convergence, but under a noisier optimization regime.
Depth sensitivity is visible, but the visibility depends on the module.
- On Attention, the difference between layer 0 only and layers 0–8 is easier to observe.
- On MoE, single-layer and multi-layer curves can look close under the current global metric, but they are not identical. The gap depends on checkpoint and data window.
So the correct interpretation is close-but-not-equal, not "the same result" and not "later layers do nothing."
A recurring pattern in these experiments is that loss can remain strongly aligned even when gradient-space perturbation is clearly nonzero.
This is one of the main motivations for tracking:
- gradient difference norm,
- relative gradient error,
- and parameter-level perturbation ranking.
Top-ranked grad_diff_norm parameters do not remain permanently localized at the directly quantized output.
Instead, the dominant contributors shift during training, which is consistent with a cross-module redistribution view of precision-induced perturbation.
This repository therefore treats low-precision effects as a training-dynamics problem, not just a final-loss problem.
Interpretation.
This figure summarizes the most representative multi-layer MoE E4M3 setting in one panel.
The loss remains close to BF16, while the perturbation in gradient space stays nonzero and structurally visible across training.
Interpretation.
Single-layer MoE quantization does not materially break loss alignment in the current setting.
This makes it a useful low-perturbation reference when compared with deeper quantization.
Interpretation.
The perturbation is clearly nonzero, but it remains bounded rather than explosively unstable.
This is consistent with the broader March observation that MoE perturbations can persist even when the scalar loss stays aligned.
Interpretation.
Loss relative error remains small, while gradient relative error stays visibly nonzero.
Again, the key message is that loss alone under-describes the precision effect.
Interpretation.
This figure illustrates a noisier format/module combination.
Even when the loss is still convergent and close to BF16 in absolute scale, the deviation profile is visibly less well aligned than in the more stable E4M3 settings.
Interpretation.
The relative-error view makes the format sensitivity more explicit.
Compared with E4M3-based settings in the March study, E5M2 is more consistent with a noisier and less BF16-aligned optimization process.
Interpretation.
This figure provides the BF16 reference trajectory for parameter-scale evolution over training.
It is useful as a baseline context when interpreting perturbation experiments, even though the checkpoint-based quantized evaluations here do not themselves update weights.
This March update intentionally shifts the repository away from an earlier emphasis on gradient-angle narratives and toward a more stable summary based on:
- loss-level alignment,
- gradient difference norm,
- relative error,
- module sensitivity,
- depth sensitivity,
- format sensitivity,
- and parameter-level perturbation redistribution.
The current wording is deliberately more conservative:
the repository does not claim that FP8 directly breaks convergence in this setup.
Instead, it shows that FP8 can preserve convergence while still changing optimization behavior in structured, measurable ways.
The current repository should still be read with the following limits in mind:
- Results are based on fixed checkpoints and fixed evaluation windows, not full end-to-end retraining under every quantized setting.
- Some conclusions are metric-dependent, especially for sparse MoE under union-based global gradient norms.
- Parameter-level perturbation migration is currently an empirical pattern, not a formal causal proof.
- The present public repository does not include internal infrastructure, full training code, or unreleased project components.
Planned next steps include:
- adding more checkpoint windows and data offsets,
- extending parameter-level analysis into layer/module-level summaries,
- testing whether these perturbations correlate with longer-horizon quality or generalization changes,
- and expanding the study of module-dependent precision sensitivity in sparse architectures.
This repository focuses on empirical training-dynamics evidence rather than polished benchmark reporting.
The goal is to make precision-induced optimization effects visible, measurable, and interpretable in a controlled systems setting.






