Smooth-SwiGLU: Stable FP8 Activation for LLMs
- Smooth-SwiGLU is a modified SwiGLU activation that uses per-channel scaling to restrict dynamic range, ensuring stable FP8 quantized training for large language models.
- It mitigates outlier amplification by clipping the quadratic branch, thereby maintaining convergence and performance over ultra-long training runs.
- Experiments on Llama2-7B demonstrate up to a 34% throughput boost with comparable accuracy to BF16, highlighting its efficiency and practical impact.
Smooth-SwiGLU is a per-channel, affine-scaled modification of the SwiGLU activation, introduced to stabilize FP8 quantized training of LLMs over extended durations and trillion-token-scale datasets. It systematically restricts activation dynamic range prior to quantization, ensuring convergence and unlocking the full computational speed advantage of FP8, with no loss in function capacity or final accuracy compared to higher-precision baselines (Fishman et al., 2024).
1. Mathematical Formulation
The standard SwiGLU activation for a channel parameterized by weight vectors and input is defined as:
where denotes the sigmoid nonlinearity.
Smooth-SwiGLU modifies this computation in the quantized setting, applying a per-channel scaling for channel , quantizing both weights and inputs, and rewriting the activation as
where is the FP8 quantization operator and denotes quantized weights. The three critical steps are:
- Scale the linear branch output for each channel by before quantization,
- Quantize the scaled product in FP8,
- Divide by post-quantization to restore the original scaling.
This mechanism bounds the pre-quantization value for each channel, directly controlling the quantizer’s dynamic range.
2. Theoretical Motivation and Outlier Control
The necessity for Smooth-SwiGLU arises from an analytic phenomenon in prolonged training: under typical regularization and data statistics, the two SwiGLU weight vectors and can spontaneously align or anti-align at stationary points for large-argument activations (where ). This alignment produces
sharply amplifying the activations as input norm grows. Over hundreds of billions of tokens, such amplification manifests as rare but extremely large outliers in activation distributions for certain channels. These spikes break the assumptions of delayed scaling and FP8 quantizer dynamic range, leading to divergent, unstable training only apparent in ultra-long runs (≫200B tokens).
Smooth-SwiGLU preempts this instability by clipping the quadratic branch—scaling the activation before quantization ensures no value exceeds per channel, with selected to match the empirical maximum activation observed in each channel over the mini-batch or a running window. This step reinstates statistical consistency in the observed quantizer input distribution, thus restoring stable FP8 training throughout ultra-long training horizons (Fishman et al., 2024).
3. Experimental Outcomes and Training Regimes
The behavior and efficacy of Smooth-SwiGLU are established via extensive training on Llama2-7B architectures (decoder-only Transformer, RMSNorm, SwiGLU activations, rotary embeddings) using the RedPajama corpus up to 2T tokens. The setup employs FP8 for activations (E4M3) and gradients (E5M2), delayed per-tensor scaling, and runs on 256 Intel Gaudi2 accelerators for large-scale experiments.
Key empirical findings include:
- Standard FP8-trained SwiGLU diverges after ~200B tokens, as evidenced in loss curves and activation distribution visualizations (Fig. 1a, 1b, 4).
- Smooth-SwiGLU maintains stability throughout, closely matching BF16 in both training and loss metrics (Fig. 6).
- Outlier formation in activation histograms appears exclusively in channels where , become highly correlated—these spikes are suppressed by Smooth-SwiGLU's per-channel scaling.
- No significant underflow/overflow statistics are reported, but loss and activation histograms serve as effective proxies for numerics.
4. Performance and Model Quality Implications
Smooth-SwiGLU delivers up to a throughput improvement versus BF16, with negligible impact on accuracy:
| Setting | Throughput (samples/s, 8 Gaudi2) | Lambada Acc. | HellaSwag | Perplexity (Wiki) |
|---|---|---|---|---|
| BF16 | 12.65 | 61.98 | 68.30 | 5.59 |
| FP8+Smooth-SwiGLU | 16.89 (+33.5%) | 61.73 | 68.03 | 5.56 |
| FP8+disable SwiGLU quant | 16.07 (+27.0%) | – | – | – |
| Standard SwiGLU (FP8, unstable) | 17.34 (+37.1%) | diverges | diverges | diverges |
Final model quality (Table 3) matches the higher-precision baseline across zero-shot tasks, with accuracy and perplexity differences within statistical noise. The cost of per-channel scaling is minimal and parallelized, negligible compared to matrix multiplication overhead. At inference, scalers are fused into adjacent weight matrices, producing no runtime impact.
5. Practical Implementation
Smooth-SwiGLU is architected as a drop-in for the SwiGLU+linear block in FP8 pipelines. The principal implementation steps are:
- Per-channel scale computation: For hidden dimension sliced into channels, compute each as the maximum absolute activation observed over a mini-batch (optionally using a running max for smoothing).
- Forward pass integration:
- Compute and .
- Form .
- Pre-quantization scaling: .
- Post-quantization rescale: .
- Autograd considerations: Treat division by as a fixed scaling of gradients.
- Optimizer compatibility: Smooth-SwiGLU integrates directly with newly demonstrated FP8-quantized Adam (first moment in E4M3, second in E5M2).
- Edge handling: Clamp to a small if activations are zero. Apply momentum updates to to suppress jitter.
- Inference fusion: At deployment, and are folded into pre-/postlinear layer weights, removing any extra compute.
6. Significance in Large-Scale FP8 Training
Smooth-SwiGLU represents a minimal, analytically motivated stabilization of activation quantization critical for reliable large-scale LLM training with FP8 arithmetic. Its design both counteracts outlier amplification characteristic of SwiGLU in long training regimes and preserves the functional equivalence, throughput, and accuracy of FP8 LLM training. A reference implementation is maintained at https://github.com/Anonymous1252022/Megatron-DeepSpeed. The approach’s seamless integration with FP8 Adam optimizers and negligible runtime/inference overhead position it as an essential technique for efficient, stable billion-parameter model training at scale (Fishman et al., 2024).