MGF-softmax: HE Softmax Reformulation
- MGF-softmax is a reformulation of the softmax function using the moment generating function to significantly reduce circuit depth and computational overhead in homomorphic encryption.
- It preserves key properties like shift-invariance and asymptotically converges to standard softmax as input dimensions grow, ensuring representational accuracy.
- Empirical benchmarks on transformers and language models show near-plaintext accuracy with drastically lower runtime and hardware cost.
MGF-softmax is a reformulation of the softmax function leveraging the moment generating function (MGF), specifically designed to address the computational constraints inherent to privacy-preserving machine learning with homomorphic encryption (HE). Traditional softmax incurs substantial multiplicative depth and circuit complexity under HE, posing critical challenges for efficient encrypted inference, particularly in transformer-based architectures. MGF-softmax replaces the softmax denominator with an MGF-based normalization, drastically reducing multiplicative depth and computational overhead, while asymptotically preserving the representational properties of the standard softmax as the input dimension grows (Park et al., 2 Feb 2026).
1. Mathematical Formulation and Principle
MGF-softmax reinterprets the softmax transformation through the probabilistic lens of the moment generating function. For an input vector , the standard softmax is
The denominator, , is recast as an empirical mean approximating the MGF at for a random variable whose samples are . The true (ensemble) mean is . MGF-softmax utilizes this as: Equivalently, the cumulant generating function provides a normalization shift, yielding: This replacement transforms the normalization step from a sum of exponentials to a moment-based scalar, inherently smoother and more tractable for polynomial approximations compatible with HE schemes (Park et al., 2 Feb 2026).
2. Theoretical Properties
Shift-Invariance
MGF-softmax exhibits the same shift-invariance as classical softmax. For any constant , shifting all entries by leaves outputs unchanged: This property, critical for the numerical stability of attention mechanisms in deep learning, is preserved under the MGF-based normalization (Park et al., 2 Feb 2026).
Asymptotic Convergence
MGF-softmax approximates standard softmax more accurately as increases. The central limit theorem ensures that, with ,
vanishes as , with denoting the standard normal CDF. Therefore, MGF-softmax is asymptotically equivalent to standard softmax under mild distributional conditions (Park et al., 2 Feb 2026).
Multiplicative Depth Reduction
By eliminating division and max-subtraction operations, MGF-softmax requires only basic polynomial operations (mean, variance, exponentiation) implementable via addition and multiplication. The circuit depth is reduced from (as in the Chebyshev-based baseline) to , a substantial improvement for encrypted inference where multiplicative depth is a primary bottleneck (Park et al., 2 Feb 2026).
3. Algorithmic Implementation in Homomorphic Encryption
The implementation of MGF-softmax in the CKKS HE scheme relies on ciphertext packing and efficient rotation patterns to compute row-wise means and variances. Given a matrix , row-wise statistics are aggregated using slot rotations and additions. The core computational steps are as follows:
- Compute the sample mean per row.
- Center inputs and accumulate variance .
- Construct the cumulant shift and subtract .
- Apply polynomial exponential via Chebyshev or limit-based approximations, optionally with domain scaling.
This workflow enables a high-throughput, depth-efficient softmax suitable for large-scale inference over encrypted data without requiring costly bootstrapping per arithmetic layer (Park et al., 2 Feb 2026).
Circuit Complexity Comparison
| Method | Depth | # CMult | # Rot | Bootstraps |
|---|---|---|---|---|
| Cho et al. | ||||
| MGF-softmax |
A plausible implication is that homomorphic inference for transformers and ViTs with MGF-softmax can be realized at vastly reduced runtime and hardware cost compared to prior approaches.
4. Empirical Performance and Applications
MGF-softmax has been experimentally validated on ImageNet-1k with Vision Transformers (ViT/DeiT Tiny and Base) and on LLMs such as LLaMA-3.2-1B across diverse NLP benchmarks (Clinc150, Banking77, SST-2). Results indicate that:
- For inference depths in the range 7–10, MGF-softmax achieves accuracy within 1% of plaintext models.
- Low-degree variants outperform alternative polynomial baselines (e.g., Powerformer/BPMax) by 6–10% on large-class tasks.
- Total runtime for a softmax drops from 105.74s (baseline) to 3.06s with MGF-softmax, with timing breakdowns attributable primarily to the elimination of bootstrapping and heavy multiplication (Park et al., 2 Feb 2026).
These benchmarks confirm the practical viability of MGF-softmax for encrypted inference in real-world, privacy-sensitive domains such as healthcare and finance.
5. Distributional Assumptions, Limitations, and Extensions
Approximation accuracy of MGF-softmax depends on input element distribution. The principal assumption is i.i.d. entries with approximate normality. For small , empirical moment estimates may bias normalization, increasing error. Prospective extensions include:
- Incorporating higher-order cumulants to accommodate heavy-tailed distributions.
- Adaptive selection of moments per row to match input statistics.
- Extending the MGF-reformulation paradigm to other normalization layers (e.g., layer-norm).
- Integration with hybrid HE + Multi-Party Computation (MPC) protocols to address low- or worst-case error instances (Park et al., 2 Feb 2026).
These directions suggest MGF-softmax serves as a template for further innovations in encrypted deep learning and normalization method reformulation for constrained arithmetic domains.
6. Context within Broader Softmax Gating Paradigms
While MGF-softmax directly addresses HE efficiency issues, other softmax-gated models—such as softmax gating in mixture-of-experts (MoE) or adaptive fusion in multimodal architectures—exploit the gating function's nonlinearity for interpretability and dynamic adaptation (Yap et al., 23 Nov 2025, Nguyen et al., 2023). MGF-softmax differentiates itself by treating normalization as a functional transformation amenable to moment-based statistical approximation, as opposed to parameterized gating for mixture allocation or adaptive fusion. Notably, shift-invariance and normalization remain consistent across these paradigms, but MGF-softmax is distinctive in automating the normalization for environments lacking native division or comparison circuits.
7. Research Impact and Ongoing Developments
MGF-softmax marks a significant advance in enabling practical, accurate, and low-latency encrypted inference for deep learning architectures that rely on softmax computations. Its theoretical grounding in classical probability and cumulant expansions, combined with empirical validation on state-of-the-art models, underscores its relevance for privacy-preserving AI. Ongoing research aims to generalize these ideas to broader classes of nonlinearities and normalization schemes, and to refine approximation strategies by leveraging input distribution properties and statistical learning theories (Park et al., 2 Feb 2026).