Modified SwiGLU: Efficiency & Stability

Updated 3 February 2026

Modified SwiGLU is a refinement of the original SwiGLU, introducing variants like SwiMGLU and Smooth-SwiGLU to enhance memory efficiency, computational throughput, and training stability.
SwiMGLU consolidates gating and value projections via complementary binary masks, reducing memory traffic and cutting weight requirements by up to 30% while accelerating inference.
Smooth-SwiGLU employs per-channel scaling to clamp activations in FP8 quantization, effectively preventing outlier amplification and ensuring stable training over extended token counts.

Modified SwiGLU encompasses a set of advancements to the original Swish-Gated Linear Unit (SwiGLU) formulation aimed at improving memory efficiency, computational throughput, and training stability in large-scale neural networks. Two major lines of modification—SwiMGLU and Smooth-SwiGLU—systematically address the observed hardware inefficiencies and quantization instability of standard SwiGLU, enabling its deployment in latency- and bandwidth-constrained scenarios as well as in full FP8-quantized training regimes (Tajima et al., 29 Jun 2025, Fishman et al., 2024).

1. Mathematical Formulations and Variants

Standard SwiGLU

Given an input $x\in\mathbb R^d$ , and "up-projection" matrices $W_g,W_v\in\mathbb{R}^{d\times h}$ , with Swish activation $\mathrm{Swish}(t)=t\sigma(t)$ where $\sigma(t)=1/(1+e^{-t})$ , the SwiGLU is defined as: $h_{\mathrm{SwiGLU}} = (x W_g) \odot \mathrm{Swish}(x W_v)$ The typical forward update with output projection $W_o\in\mathbb{R}^{h\times d}$ and residual connection is: $y_{\mathrm{SwiGLU}} = x + \left( (x W_g) \odot \mathrm{Swish}(x W_v) \right) W_o$

SwiMGLU (Masked Gated Linear Unit, Swish-activated)

MGLU replaces $W_g, W_v$ with a single full-rank $W\in\mathbb{R}^{d\times h}$ and $n_m$ elementwise binary masks $M_i\in\{0,1\}^{d\times h}$ , each with complementary $\overline{M}_i = 1-M_i$ . For each "route" $i$ : $\mathrm{Gate}_i(x) = x (M_i \odot W);\qquad \mathrm{Val}_i(x) = x (\overline{M}_i \odot W)$ The Mixture-of-Element-wise-Gating (MoEG) combines these as: $h_{\mathrm{SwiMGLU}}(x) = \sum_{i=1}^{n_m} \mathrm{Swish}(x (M_i \odot W)) \odot (x (\overline{M}_i \odot W))$ with the final update: $y_{\mathrm{SwiMGLU}} = x + [h_{\mathrm{SwiMGLU}}(x)] W_o$

Smooth-SwiGLU

Under FP8 quantization, SwiGLU intermediate activations become unstable due to outlier amplification via weight alignment. Smooth-SwiGLU applies a per-channel scalar $s_i$ to clamp and later rescale the problematic branch, yielding for neuron/channel $i$ : $\mathrm{Smooth\mbox{-}SwiGLU}_{\hat w_{1,i},\hat w_{2,i}}(x) = s_i^{-1}\;Q\Bigl(s_i\,(\hat w_{1,i}^\top Q(x))\;\mathrm{Swish}(\hat w_{2,i}^\top Q(x))\Bigr)$ where $Q(\cdot)$ is the FP8 quantization operator. The real-valued function remains unaltered, but all intermediates are guaranteed to reside within the representable FP8 dynamic range (Fishman et al., 2024).

2. Architectural and Implementation Differences

SwiGLU employs two independent $d\times h$ projection matrices for the gate and value streams, doubling memory traffic compared to standard feed-forward layers. At inference, SwiGLU thus demands $2d\times h$ FP16 reads per token.

SwiMGLU consolidates gating and value projections by leveraging a single $d\times h$ FP16 matrix plus $n_m$ binary masks, typically $n_m\leq 16$ . During training, masks are stored as real-valued logits, updated via backpropagation and binarized with a straight-through estimator. In inference, binary masks are fused with $W$ in a custom kernel (FlashMGLU) that coalesces memory reads and computations, minimizing data transfer and mat-vec operations (Tajima et al., 29 Jun 2025).

Smooth-SwiGLU focuses on numerical stability. It scales and then inversely rescales the pre-quantized activations, so it can be implemented with minimal computational cost: during inference, the scale factors $s_i$ can be absorbed into the first and third linear-layer weights, resulting in zero additional inference overhead (Fishman et al., 2024).

3. Performance Characteristics

Inference-time Speed and Memory Compression

Under FP16 on an NVIDIA RTX 5090 (h=8192, d=2048):

Naïve PyTorch MGLU ( $n_m=8$ ): $0.521$ ms per matvec,
FlashMGLU ( $n_m=8$ ): $0.0265$ ms, a $19.7\times$ speed-up,
PyTorch GLU: $0.0306$ ms; FlashMGLU is $1.15\times$ faster than standard GLU.

SwiMGLU requires at most $(16+n_m)d\times h$ bits, reducing memory transfer by up to $47\%$ for $n_m=1$ and cutting per-layer storage from $96$ MB to $64$ MB (plus $\sim 2$ MB mask overhead) for a Llama-1B FFN layer ( $h=2048, d=8192$ ) (Tajima et al., 29 Jun 2025).

Downstream Task Accuracy

Across six benchmarks (zero-shot and two-shot):

Model & Scale	Zero-shot (%)	Two-shot (%)	Weights
SwiGLU ( $h=768, d=3072$ )	46.20	45.52	141M
SwiMGLU ( $n_m=4$ )	46.48	46.40	113M+mask
SwiGLU ( $h=2048, d=8192$ )	56.00	57.36	1.08B
SwiMGLU ( $n_m=4$ )	56.85	57.87	808M+mask

This demonstrates that SwiMGLU matches or slightly exceeds standard SwiGLU accuracy using $20\text{–}30\%$ fewer weights (Tajima et al., 29 Jun 2025).

Quantized Training Stability and Throughput

With full FP8 training:

Standard FP8+SwiGLU diverges after $\sim 200$ B tokens due to quadratic outlier amplification.
FP8+Smooth-SwiGLU remains as stable as the BF16 baseline for at least $300$B tokens.
Throughput: FP8+Smooth-SwiGLU achieves $16.89$ samples/sec (on 8 Gaudi2 cards, micro-batch 1), a $33.5\%$ improvement over BF16 (Fishman et al., 2024).

4. Theoretical Analysis: Weight Alignment and Outlier Amplification

SwiGLU's two projection vectors $w_1$ and $w_2$ exhibit $\ell_2$ -regularized alignment over prolonged training. At convergence, if the Swish activation saturates ( $\sigma'(x^\top w_2)\to 0$ ), KKT conditions enforce $w_1 \approx w_2$ . The network's output then approximates $(x^\top w_2)^2$ , causing moderate increases in $x^\top w_2$ to be squared, amplifying outliers. In low-precision regimes, such as FP8, this produces spikes far outside the dynamic range, breaking delayed scaling assumptions and leading to loss divergence after extended training ( $\sim 200$ B tokens) (Fishman et al., 2024). Smooth-SwiGLU, by range-limiting the branch prior to quantization, effectively blocks this failure mode.

5. Deployment Considerations and Practical Takeaways

SwiMGLU is most advantageous in settings with memory or latency constraints. Its memory savings and inference acceleration (up to $20\times$ faster than naive MGLU, $1.3\text{--}1.5\times$ versus standard GLU) make it suitable for edge, HBM-limited servers, and mobile deployments, without sacrificing downstream accuracy or necessitating retraining (Tajima et al., 29 Jun 2025).

Smooth-SwiGLU is a direct drop-in replacement for SwiGLU in full FP8 pipelines, providing theoretical and empirical guarantees for training stability even across trillion-token, multi-hundred-billion-parameter scales. It achieves this without changing the functional map or introducing any measurable impact on final model quality (Fishman et al., 2024).

6. Implementation Notes and Hyperparameter Settings

For SwiMGLU:

Training employs real-valued mask logits updated by backpropagation; masks are binarized using the straight-through estimator.
FlashMGLU kernel coalesces memory reads and on-chip computation for maximal efficiency.

For Smooth-SwiGLU:

FP8 formats "E4M3" and "E5M2" are used for activations/weights and gradients, respectively.
Adam moments are quantized (first: E4M3, second: E5M2).
Per-channel scales $s_i$ are computed via max-abs statistics over calibration minibatches; at inference, these are incorporated into the static weights (no runtime penalty).
Other hyperparameters (learning rate, decay, layer norm, etc.) follow Llama 2 defaults (Fishman et al., 2024).

7. Significance and Outlook

Modified SwiGLU architectures—including SwiMGLU and Smooth-SwiGLU—resolve the two principal limitations of the standard SwiGLU: inefficient memory bandwidth and quantization-induced instability. SwiMGLU enables large-scale LLMs to meet hardware and deployment constraints, while Smooth-SwiGLU extends SwiGLU's applicability to ultra-low-precision domains such as FP8 training. These advances preserve or improve downstream performance at reduced computational and storage cost, providing robust building blocks for the next generation of efficient LLMs (Tajima et al., 29 Jun 2025, Fishman et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Masked Gated Linear Unit (2025)

Scaling FP8 training to trillion-token LLMs (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Modified SwiGLU.