Smooth-SwiGLU: Stable FP8 Activation for LLMs

Updated 15 December 2025

Smooth-SwiGLU is a modified SwiGLU activation that uses per-channel scaling to restrict dynamic range, ensuring stable FP8 quantized training for large language models.
It mitigates outlier amplification by clipping the quadratic branch, thereby maintaining convergence and performance over ultra-long training runs.
Experiments on Llama2-7B demonstrate up to a 34% throughput boost with comparable accuracy to BF16, highlighting its efficiency and practical impact.

Smooth-SwiGLU is a per-channel, affine-scaled modification of the SwiGLU activation, introduced to stabilize FP8 quantized training of LLMs over extended durations and trillion-token-scale datasets. It systematically restricts activation dynamic range prior to quantization, ensuring convergence and unlocking the full computational speed advantage of FP8, with no loss in function capacity or final accuracy compared to higher-precision baselines (Fishman et al., 2024).

1. Mathematical Formulation

The standard SwiGLU activation for a channel parameterized by weight vectors $w_1, w_2 \in \mathbb{R}^d$ and input $x \in \mathbb{R}^d$ is defined as:

$\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),$

where $\sigma(z) = 1 / (1 + e^{-z})$ denotes the sigmoid nonlinearity.

Smooth-SwiGLU modifies this computation in the quantized setting, applying a per-channel scaling $s_i$ for channel $i$ , quantizing both weights and inputs, and rewriting the activation as

$\mathrm{Smooth\text{-}SwiGLU}_{\hat w_{1,i}, \hat w_{2,i}}(x) = s_{i}^{-1}\, Q\big( s_{i}\; (\hat w_{1,i}^\top Q(x))\; \mathrm{Swish}(\hat w_{2,i}^\top Q(x))\big),$

where $Q$ is the FP8 quantization operator and $\hat w$ denotes quantized weights. The three critical steps are:

Scale the linear branch output for each channel by $s_i$ before quantization,
Quantize the scaled product in FP8,
Divide by $x \in \mathbb{R}^d$ 0 post-quantization to restore the original scaling.

This mechanism bounds the pre-quantization value for each channel, directly controlling the quantizer’s dynamic range.

2. Theoretical Motivation and Outlier Control

The necessity for Smooth-SwiGLU arises from an analytic phenomenon in prolonged training: under typical $x \in \mathbb{R}^d$ 1 regularization and data statistics, the two SwiGLU weight vectors $x \in \mathbb{R}^d$ 2 and $x \in \mathbb{R}^d$ 3 can spontaneously align or anti-align at stationary points for large-argument activations (where $x \in \mathbb{R}^d$ 4). This alignment produces

$x \in \mathbb{R}^d$ 5

sharply amplifying the activations as input norm grows. Over hundreds of billions of tokens, such amplification manifests as rare but extremely large outliers in activation distributions for certain channels. These spikes break the assumptions of delayed scaling and FP8 quantizer dynamic range, leading to divergent, unstable training only apparent in ultra-long runs (≫200B tokens).

Smooth-SwiGLU preempts this instability by clipping the quadratic branch—scaling the activation before quantization ensures no value exceeds $x \in \mathbb{R}^d$ 6 per channel, with $x \in \mathbb{R}^d$ 7 selected to match the empirical maximum activation observed in each channel over the mini-batch or a running window. This step reinstates statistical consistency in the observed quantizer input distribution, thus restoring stable FP8 training throughout ultra-long training horizons (Fishman et al., 2024).

3. Experimental Outcomes and Training Regimes

The behavior and efficacy of Smooth-SwiGLU are established via extensive training on Llama2-7B architectures (decoder-only Transformer, RMSNorm, SwiGLU activations, rotary embeddings) using the RedPajama corpus up to 2T tokens. The setup employs FP8 for activations (E4M3) and gradients (E5M2), delayed per-tensor scaling, and runs on 256 Intel Gaudi2 accelerators for large-scale experiments.

Key empirical findings include:

Standard FP8-trained SwiGLU diverges after ~200B tokens, as evidenced in loss curves and activation distribution visualizations (Fig. 1a, 1b, 4).
Smooth-SwiGLU maintains stability throughout, closely matching BF16 in both training and loss metrics (Fig. 6).
Outlier formation in activation histograms appears exclusively in channels where $x \in \mathbb{R}^d$ 8, $x \in \mathbb{R}^d$ 9 become highly correlated—these spikes are suppressed by Smooth-SwiGLU's per-channel scaling.
No significant underflow/overflow statistics are reported, but loss and activation histograms serve as effective proxies for numerics.

4. Performance and Model Quality Implications

Smooth-SwiGLU delivers up to a $\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),$ 0 throughput improvement versus BF16, with negligible impact on accuracy:

Setting	Throughput (samples/s, 8 Gaudi2)	Lambada Acc.	HellaSwag	Perplexity (Wiki)
BF16	12.65	61.98	68.30	5.59
FP8+Smooth-SwiGLU	16.89 (+33.5%)	61.73	68.03	5.56
FP8+disable SwiGLU quant	16.07 (+27.0%)	–	–	–
Standard SwiGLU (FP8, unstable)	17.34 (+37.1%)	diverges	diverges	diverges

Final model quality (Table 3) matches the higher-precision baseline across zero-shot tasks, with accuracy and perplexity differences within statistical noise. The cost of per-channel scaling is minimal and parallelized, negligible compared to matrix multiplication overhead. At inference, scalers are fused into adjacent weight matrices, producing no runtime impact.

5. Practical Implementation

Smooth-SwiGLU is architected as a drop-in for the SwiGLU+linear block in FP8 pipelines. The principal implementation steps are:

Per-channel scale computation: For hidden dimension sliced into channels, compute each $\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),$ 1 as the maximum absolute activation observed over a mini-batch (optionally using a running max for smoothing).
Forward pass integration:
- Compute $\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),$ 2 and $\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),$ 3.
- Form $\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),$ 4.
- Pre-quantization scaling: $\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),$ 5.
- Post-quantization rescale: $\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),$ 6.
Autograd considerations: Treat division by $\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),$ 7 as a fixed scaling of gradients.
Optimizer compatibility: Smooth-SwiGLU integrates directly with newly demonstrated FP8-quantized Adam (first moment in E4M3, second in E5M2).
Edge handling: Clamp $\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),$ 8 to a small $\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),$ 9 if activations are zero. Apply momentum updates to $\sigma(z) = 1 / (1 + e^{-z})$ 0 to suppress jitter.
Inference fusion: At deployment, $\sigma(z) = 1 / (1 + e^{-z})$ 1 and $\sigma(z) = 1 / (1 + e^{-z})$ 2 are folded into pre-/postlinear layer weights, removing any extra compute.

6. Significance in Large-Scale FP8 Training

Smooth-SwiGLU represents a minimal, analytically motivated stabilization of activation quantization critical for reliable large-scale LLM training with FP8 arithmetic. Its design both counteracts outlier amplification characteristic of SwiGLU in long training regimes and preserves the functional equivalence, throughput, and accuracy of FP8 LLM training. A reference implementation is maintained at https://github.com/Anonymous1252022/Megatron-DeepSpeed. The approach’s seamless integration with FP8 Adam optimizers and negligible runtime/inference overhead position it as an essential technique for efficient, stable billion-parameter model training at scale (Fishman et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Scaling FP8 training to trillion-token LLMs (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Smooth-SwiGLU Activation Function.