Modified SwiGLU: Efficiency & Stability
- Modified SwiGLU is a refinement of the original SwiGLU, introducing variants like SwiMGLU and Smooth-SwiGLU to enhance memory efficiency, computational throughput, and training stability.
- SwiMGLU consolidates gating and value projections via complementary binary masks, reducing memory traffic and cutting weight requirements by up to 30% while accelerating inference.
- Smooth-SwiGLU employs per-channel scaling to clamp activations in FP8 quantization, effectively preventing outlier amplification and ensuring stable training over extended token counts.
Modified SwiGLU encompasses a set of advancements to the original Swish-Gated Linear Unit (SwiGLU) formulation aimed at improving memory efficiency, computational throughput, and training stability in large-scale neural networks. Two major lines of modification—SwiMGLU and Smooth-SwiGLU—systematically address the observed hardware inefficiencies and quantization instability of standard SwiGLU, enabling its deployment in latency- and bandwidth-constrained scenarios as well as in full FP8-quantized training regimes (Tajima et al., 29 Jun 2025, Fishman et al., 2024).
1. Mathematical Formulations and Variants
Standard SwiGLU
Given an input , and "up-projection" matrices , with Swish activation where , the SwiGLU is defined as: The typical forward update with output projection and residual connection is:
SwiMGLU (Masked Gated Linear Unit, Swish-activated)
MGLU replaces with a single full-rank and elementwise binary masks , each with complementary . For each "route" : The Mixture-of-Element-wise-Gating (MoEG) combines these as: with the final update:
Smooth-SwiGLU
Under FP8 quantization, SwiGLU intermediate activations become unstable due to outlier amplification via weight alignment. Smooth-SwiGLU applies a per-channel scalar to clamp and later rescale the problematic branch, yielding for neuron/channel : $\mathrm{Smooth\mbox{-}SwiGLU}_{\hat w_{1,i},\hat w_{2,i}}(x) = s_i^{-1}\;Q\Bigl(s_i\,(\hat w_{1,i}^\top Q(x))\;\mathrm{Swish}(\hat w_{2,i}^\top Q(x))\Bigr)$ where is the FP8 quantization operator. The real-valued function remains unaltered, but all intermediates are guaranteed to reside within the representable FP8 dynamic range (Fishman et al., 2024).
2. Architectural and Implementation Differences
SwiGLU employs two independent projection matrices for the gate and value streams, doubling memory traffic compared to standard feed-forward layers. At inference, SwiGLU thus demands FP16 reads per token.
SwiMGLU consolidates gating and value projections by leveraging a single FP16 matrix plus binary masks, typically . During training, masks are stored as real-valued logits, updated via backpropagation and binarized with a straight-through estimator. In inference, binary masks are fused with in a custom kernel (FlashMGLU) that coalesces memory reads and computations, minimizing data transfer and mat-vec operations (Tajima et al., 29 Jun 2025).
Smooth-SwiGLU focuses on numerical stability. It scales and then inversely rescales the pre-quantized activations, so it can be implemented with minimal computational cost: during inference, the scale factors can be absorbed into the first and third linear-layer weights, resulting in zero additional inference overhead (Fishman et al., 2024).
3. Performance Characteristics
Inference-time Speed and Memory Compression
Under FP16 on an NVIDIA RTX 5090 (h=8192, d=2048):
- Naïve PyTorch MGLU (): $0.521$ ms per matvec,
- FlashMGLU (): $0.0265$ ms, a speed-up,
- PyTorch GLU: $0.0306$ ms; FlashMGLU is faster than standard GLU.
SwiMGLU requires at most bits, reducing memory transfer by up to for and cutting per-layer storage from $96$ MB to $64$ MB (plus MB mask overhead) for a Llama-1B FFN layer () (Tajima et al., 29 Jun 2025).
Downstream Task Accuracy
Across six benchmarks (zero-shot and two-shot):
| Model & Scale | Zero-shot (%) | Two-shot (%) | Weights |
|---|---|---|---|
| SwiGLU () | 46.20 | 45.52 | 141M |
| SwiMGLU () | 46.48 | 46.40 | 113M+mask |
| SwiGLU () | 56.00 | 57.36 | 1.08B |
| SwiMGLU () | 56.85 | 57.87 | 808M+mask |
This demonstrates that SwiMGLU matches or slightly exceeds standard SwiGLU accuracy using fewer weights (Tajima et al., 29 Jun 2025).
Quantized Training Stability and Throughput
With full FP8 training:
- Standard FP8+SwiGLU diverges after B tokens due to quadratic outlier amplification.
- FP8+Smooth-SwiGLU remains as stable as the BF16 baseline for at least $300$B tokens.
- Throughput: FP8+Smooth-SwiGLU achieves $16.89$ samples/sec (on 8 Gaudi2 cards, micro-batch 1), a improvement over BF16 (Fishman et al., 2024).
4. Theoretical Analysis: Weight Alignment and Outlier Amplification
SwiGLU's two projection vectors and exhibit -regularized alignment over prolonged training. At convergence, if the Swish activation saturates (), KKT conditions enforce . The network's output then approximates , causing moderate increases in to be squared, amplifying outliers. In low-precision regimes, such as FP8, this produces spikes far outside the dynamic range, breaking delayed scaling assumptions and leading to loss divergence after extended training (B tokens) (Fishman et al., 2024). Smooth-SwiGLU, by range-limiting the branch prior to quantization, effectively blocks this failure mode.
5. Deployment Considerations and Practical Takeaways
SwiMGLU is most advantageous in settings with memory or latency constraints. Its memory savings and inference acceleration (up to faster than naive MGLU, versus standard GLU) make it suitable for edge, HBM-limited servers, and mobile deployments, without sacrificing downstream accuracy or necessitating retraining (Tajima et al., 29 Jun 2025).
Smooth-SwiGLU is a direct drop-in replacement for SwiGLU in full FP8 pipelines, providing theoretical and empirical guarantees for training stability even across trillion-token, multi-hundred-billion-parameter scales. It achieves this without changing the functional map or introducing any measurable impact on final model quality (Fishman et al., 2024).
6. Implementation Notes and Hyperparameter Settings
For SwiMGLU:
- Training employs real-valued mask logits updated by backpropagation; masks are binarized using the straight-through estimator.
- FlashMGLU kernel coalesces memory reads and on-chip computation for maximal efficiency.
For Smooth-SwiGLU:
- FP8 formats "E4M3" and "E5M2" are used for activations/weights and gradients, respectively.
- Adam moments are quantized (first: E4M3, second: E5M2).
- Per-channel scales are computed via max-abs statistics over calibration minibatches; at inference, these are incorporated into the static weights (no runtime penalty).
- Other hyperparameters (learning rate, decay, layer norm, etc.) follow Llama 2 defaults (Fishman et al., 2024).
7. Significance and Outlook
Modified SwiGLU architectures—including SwiMGLU and Smooth-SwiGLU—resolve the two principal limitations of the standard SwiGLU: inefficient memory bandwidth and quantization-induced instability. SwiMGLU enables large-scale LLMs to meet hardware and deployment constraints, while Smooth-SwiGLU extends SwiGLU's applicability to ultra-low-precision domains such as FP8 training. These advances preserve or improve downstream performance at reduced computational and storage cost, providing robust building blocks for the next generation of efficient LLMs (Tajima et al., 29 Jun 2025, Fishman et al., 2024).