Papers
Topics
Authors
Recent
Search
2000 character limit reached

Smooth-SwiGLU: Stable FP8 Activation for LLMs

Updated 15 December 2025
  • Smooth-SwiGLU is a modified SwiGLU activation that uses per-channel scaling to restrict dynamic range, ensuring stable FP8 quantized training for large language models.
  • It mitigates outlier amplification by clipping the quadratic branch, thereby maintaining convergence and performance over ultra-long training runs.
  • Experiments on Llama2-7B demonstrate up to a 34% throughput boost with comparable accuracy to BF16, highlighting its efficiency and practical impact.

Smooth-SwiGLU is a per-channel, affine-scaled modification of the SwiGLU activation, introduced to stabilize FP8 quantized training of LLMs over extended durations and trillion-token-scale datasets. It systematically restricts activation dynamic range prior to quantization, ensuring convergence and unlocking the full computational speed advantage of FP8, with no loss in function capacity or final accuracy compared to higher-precision baselines (Fishman et al., 2024).

1. Mathematical Formulation

The standard SwiGLU activation for a channel parameterized by weight vectors w1,w2Rdw_1, w_2 \in \mathbb{R}^d and input xRdx \in \mathbb{R}^d is defined as:

SwiGLUw1,w2(x)=(xw1)Swish(xw2)=(xw1)(xw2)σ(xw2),\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2)\sigma(x^\top w_2),

where σ(z)=1/(1+ez)\sigma(z) = 1 / (1 + e^{-z}) denotes the sigmoid nonlinearity.

Smooth-SwiGLU modifies this computation in the quantized setting, applying a per-channel scaling sis_i for channel ii, quantizing both weights and inputs, and rewriting the activation as

Smooth-SwiGLUw^1,i,w^2,i(x)=si1Q(si  (w^1,iQ(x))  Swish(w^2,iQ(x))),\mathrm{Smooth\text{-}SwiGLU}_{\hat w_{1,i}, \hat w_{2,i}}(x) = s_{i}^{-1}\, Q\big( s_{i}\; (\hat w_{1,i}^\top Q(x))\; \mathrm{Swish}(\hat w_{2,i}^\top Q(x))\big),

where QQ is the FP8 quantization operator and w^\hat w denotes quantized weights. The three critical steps are:

  • Scale the linear branch output for each channel by sis_i before quantization,
  • Quantize the scaled product in FP8,
  • Divide by sis_i post-quantization to restore the original scaling.

This mechanism bounds the pre-quantization value for each channel, directly controlling the quantizer’s dynamic range.

2. Theoretical Motivation and Outlier Control

The necessity for Smooth-SwiGLU arises from an analytic phenomenon in prolonged training: under typical 2\ell_2 regularization and data statistics, the two SwiGLU weight vectors w1w_1 and w2w_2 can spontaneously align or anti-align at stationary points for large-argument activations (where σ(w2xn)0\sigma'(w_2^\top x_n) \to 0). This alignment produces

SwiGLU(cx)(cxw)2c2,\mathrm{SwiGLU}(c x) \approx (c x^\top w)^2 \propto c^2,

sharply amplifying the activations as input norm grows. Over hundreds of billions of tokens, such amplification manifests as rare but extremely large outliers in activation distributions for certain channels. These spikes break the assumptions of delayed scaling and FP8 quantizer dynamic range, leading to divergent, unstable training only apparent in ultra-long runs (≫200B tokens).

Smooth-SwiGLU preempts this instability by clipping the quadratic branch—scaling the activation before quantization ensures no value exceeds O(si)O(s_i) per channel, with sis_i selected to match the empirical maximum activation observed in each channel over the mini-batch or a running window. This step reinstates statistical consistency in the observed quantizer input distribution, thus restoring stable FP8 training throughout ultra-long training horizons (Fishman et al., 2024).

3. Experimental Outcomes and Training Regimes

The behavior and efficacy of Smooth-SwiGLU are established via extensive training on Llama2-7B architectures (decoder-only Transformer, RMSNorm, SwiGLU activations, rotary embeddings) using the RedPajama corpus up to 2T tokens. The setup employs FP8 for activations (E4M3) and gradients (E5M2), delayed per-tensor scaling, and runs on 256 Intel Gaudi2 accelerators for large-scale experiments.

Key empirical findings include:

  • Standard FP8-trained SwiGLU diverges after ~200B tokens, as evidenced in loss curves and activation distribution visualizations (Fig. 1a, 1b, 4).
  • Smooth-SwiGLU maintains stability throughout, closely matching BF16 in both training and loss metrics (Fig. 6).
  • Outlier formation in activation histograms appears exclusively in channels where w1w_1, w2w_2 become highly correlated—these spikes are suppressed by Smooth-SwiGLU's per-channel scaling.
  • No significant underflow/overflow statistics are reported, but loss and activation histograms serve as effective proxies for numerics.

4. Performance and Model Quality Implications

Smooth-SwiGLU delivers up to a 34%\sim 34\% throughput improvement versus BF16, with negligible impact on accuracy:

Setting Throughput (samples/s, 8 Gaudi2) Lambada Acc. HellaSwag Perplexity (Wiki)
BF16 12.65 61.98 68.30 5.59
FP8+Smooth-SwiGLU 16.89 (+33.5%) 61.73 68.03 5.56
FP8+disable SwiGLU quant 16.07 (+27.0%)
Standard SwiGLU (FP8, unstable) 17.34 (+37.1%) diverges diverges diverges

Final model quality (Table 3) matches the higher-precision baseline across zero-shot tasks, with accuracy and perplexity differences within statistical noise. The cost of per-channel scaling is minimal and parallelized, negligible compared to matrix multiplication overhead. At inference, scalers are fused into adjacent weight matrices, producing no runtime impact.

5. Practical Implementation

Smooth-SwiGLU is architected as a drop-in for the SwiGLU+linear block in FP8 pipelines. The principal implementation steps are:

  1. Per-channel scale computation: For hidden dimension sliced into channels, compute each sis_i as the maximum absolute activation observed over a mini-batch (optionally using a running max for smoothing).
  2. Forward pass integration:
    • Compute ui=w^1,iQ(x)u_i = \hat w_{1,i}^\top Q(x) and vi=w^2,iQ(x)v_i = \hat w_{2,i}^\top Q(x).
    • Form pi=uiSwish(vi)p_i = u_i \cdot \mathrm{Swish}(v_i).
    • Pre-quantization scaling: pi=Q(sipi)p_i' = Q(s_i \cdot p_i).
    • Post-quantization rescale: Smooth-SwiGLUi=pi/si\mathrm{Smooth\text{-}SwiGLU}_i = p_i'/s_i.
  3. Autograd considerations: Treat division by sis_i as a fixed scaling of gradients.
  4. Optimizer compatibility: Smooth-SwiGLU integrates directly with newly demonstrated FP8-quantized Adam (first moment in E4M3, second in E5M2).
  5. Edge handling: Clamp sis_i to a small ϵ\epsilon if activations are zero. Apply momentum updates to sis_i to suppress jitter.
  6. Inference fusion: At deployment, sis_i and si1s_i^{-1} are folded into pre-/postlinear layer weights, removing any extra compute.

6. Significance in Large-Scale FP8 Training

Smooth-SwiGLU represents a minimal, analytically motivated stabilization of activation quantization critical for reliable large-scale LLM training with FP8 arithmetic. Its design both counteracts outlier amplification characteristic of SwiGLU in long training regimes and preserves the functional equivalence, throughput, and accuracy of FP8 LLM training. A reference implementation is maintained at https://github.com/Anonymous1252022/Megatron-DeepSpeed. The approach’s seamless integration with FP8 Adam optimizers and negligible runtime/inference overhead position it as an essential technique for efficient, stable billion-parameter model training at scale (Fishman et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Smooth-SwiGLU Activation Function.