MGF-softmax: HE Softmax Reformulation

Updated 9 February 2026

MGF-softmax is a reformulation of the softmax function using the moment generating function to significantly reduce circuit depth and computational overhead in homomorphic encryption.
It preserves key properties like shift-invariance and asymptotically converges to standard softmax as input dimensions grow, ensuring representational accuracy.
Empirical benchmarks on transformers and language models show near-plaintext accuracy with drastically lower runtime and hardware cost.

MGF-softmax is a reformulation of the softmax function leveraging the moment generating function (MGF), specifically designed to address the computational constraints inherent to privacy-preserving machine learning with homomorphic encryption (HE). Traditional softmax incurs substantial multiplicative depth and circuit complexity under HE, posing critical challenges for efficient encrypted inference, particularly in transformer-based architectures. MGF-softmax replaces the softmax denominator with an MGF-based normalization, drastically reducing multiplicative depth and computational overhead, while asymptotically preserving the representational properties of the standard softmax as the input dimension grows (Park et al., 2 Feb 2026).

1. Mathematical Formulation and Principle

MGF-softmax reinterprets the softmax transformation through the probabilistic lens of the moment generating function. For an input vector $x = (x_1, \ldots, x_n)$ , the standard softmax is

$\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.$

The denominator, $\sum_{j=1}^n e^{x_j}$ , is recast as an empirical mean approximating the MGF at $t=1$ for a random variable $X$ whose samples are $\{x_i\}$ . The true (ensemble) mean is $M_X(1) = \mathbb{E}[e^X]$ . MGF-softmax utilizes this as: $\mathrm{softmax}_{\mathrm{MGF}}(x)_i = \frac{e^{x_i}}{n M_X(1)}.$ Equivalently, the cumulant generating function $K_X(1) = \ln M_X(1)$ provides a normalization shift, yielding: $\mathrm{softmax}_{\mathrm{MGF}}(x)_i = \exp(x_i - K_X(1) - \ln n).$ This replacement transforms the normalization step from a sum of exponentials to a moment-based scalar, inherently smoother and more tractable for polynomial approximations compatible with HE schemes (Park et al., 2 Feb 2026).

2. Theoretical Properties

Shift-Invariance

MGF-softmax exhibits the same shift-invariance as classical softmax. For any constant $\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.$ 0, shifting all entries by $\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.$ 1 leaves outputs unchanged: $\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.$ 2 This property, critical for the numerical stability of attention mechanisms in deep learning, is preserved under the MGF-based normalization (Park et al., 2 Feb 2026).

Asymptotic Convergence

MGF-softmax approximates standard softmax more accurately as $\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.$ 3 increases. The central limit theorem ensures that, with $\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.$ 4,

$\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.$ 5

vanishes as $\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.$ 6, with $\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.$ 7 denoting the standard normal CDF. Therefore, MGF-softmax is asymptotically equivalent to standard softmax under mild distributional conditions (Park et al., 2 Feb 2026).

Multiplicative Depth Reduction

By eliminating division and max-subtraction operations, MGF-softmax requires only basic polynomial operations (mean, variance, exponentiation) implementable via addition and multiplication. The circuit depth is reduced from $\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.$ 8 (as in the Chebyshev-based baseline) to $\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.$ 9, a substantial improvement for encrypted inference where multiplicative depth is a primary bottleneck (Park et al., 2 Feb 2026).

3. Algorithmic Implementation in Homomorphic Encryption

The implementation of MGF-softmax in the CKKS HE scheme relies on ciphertext packing and efficient rotation patterns to compute row-wise means and variances. Given a matrix $\sum_{j=1}^n e^{x_j}$ 0, row-wise statistics are aggregated using slot rotations and additions. The core computational steps are as follows:

Compute the sample mean $\sum_{j=1}^n e^{x_j}$ 1 per row.
Center inputs and accumulate variance $\sum_{j=1}^n e^{x_j}$ 2.
Construct the cumulant shift $\sum_{j=1}^n e^{x_j}$ 3 and subtract $\sum_{j=1}^n e^{x_j}$ 4.
Apply polynomial exponential $\sum_{j=1}^n e^{x_j}$ 5 via Chebyshev or limit-based approximations, optionally with domain scaling.

This workflow enables a high-throughput, depth-efficient softmax suitable for large-scale inference over encrypted data without requiring costly bootstrapping per arithmetic layer (Park et al., 2 Feb 2026).

Circuit Complexity Comparison

Method	Depth	# CMult	# Rot	Bootstraps
Cho et al.	$\sum_{j=1}^n e^{x_j}$ 6	$\sum_{j=1}^n e^{x_j}$ 7	$\sum_{j=1}^n e^{x_j}$ 8	$\sum_{j=1}^n e^{x_j}$ 9
MGF-softmax	$t=1$ 0	$t=1$ 1	$t=1$ 2	$t=1$ 3

A plausible implication is that homomorphic inference for transformers and ViTs with MGF-softmax can be realized at vastly reduced runtime and hardware cost compared to prior approaches.

4. Empirical Performance and Applications

MGF-softmax has been experimentally validated on ImageNet-1k with Vision Transformers (ViT/DeiT Tiny and Base) and on LLMs such as LLaMA-3.2-1B across diverse NLP benchmarks (Clinc150, Banking77, SST-2). Results indicate that:

For inference depths in the range 7–10, MGF-softmax achieves accuracy within 1% of plaintext models.
Low-degree variants outperform alternative polynomial baselines (e.g., Powerformer/BPMax) by 6–10% on large-class tasks.
Total runtime for a $t=1$ 4 softmax drops from 105.74s (baseline) to 3.06s with MGF-softmax, with timing breakdowns attributable primarily to the elimination of bootstrapping and heavy multiplication (Park et al., 2 Feb 2026).

These benchmarks confirm the practical viability of MGF-softmax for encrypted inference in real-world, privacy-sensitive domains such as healthcare and finance.

5. Distributional Assumptions, Limitations, and Extensions

Approximation accuracy of MGF-softmax depends on input element distribution. The principal assumption is i.i.d. entries with approximate normality. For small $t=1$ 5, empirical moment estimates may bias normalization, increasing error. Prospective extensions include:

Incorporating higher-order cumulants to accommodate heavy-tailed distributions.
Adaptive selection of moments per row to match input statistics.
Extending the MGF-reformulation paradigm to other normalization layers (e.g., layer-norm).
Integration with hybrid HE + Multi-Party Computation (MPC) protocols to address low- $t=1$ 6 or worst-case error instances (Park et al., 2 Feb 2026).

These directions suggest MGF-softmax serves as a template for further innovations in encrypted deep learning and normalization method reformulation for constrained arithmetic domains.

6. Context within Broader Softmax Gating Paradigms

While MGF-softmax directly addresses HE efficiency issues, other softmax-gated models—such as softmax gating in mixture-of-experts (MoE) or adaptive fusion in multimodal architectures—exploit the gating function's nonlinearity for interpretability and dynamic adaptation (Yap et al., 23 Nov 2025, Nguyen et al., 2023). MGF-softmax differentiates itself by treating normalization as a functional transformation amenable to moment-based statistical approximation, as opposed to parameterized gating for mixture allocation or adaptive fusion. Notably, shift-invariance and normalization remain consistent across these paradigms, but MGF-softmax is distinctive in automating the normalization for environments lacking native division or comparison circuits.

7. Research Impact and Ongoing Developments

MGF-softmax marks a significant advance in enabling practical, accurate, and low-latency encrypted inference for deep learning architectures that rely on softmax computations. Its theoretical grounding in classical probability and cumulant expansions, combined with empirical validation on state-of-the-art models, underscores its relevance for privacy-preserving AI. Ongoing research aims to generalize these ideas to broader classes of nonlinearities and normalization schemes, and to refine approximation strategies by leveraging input distribution properties and statistical learning theories (Park et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (3)

Efficient Softmax Reformulation for Homomorphic Encryption via Moment Generating Function (2026)

Brain-MGF: Multimodal Graph Fusion Network for EEG-fMRI Brain Connectivity Analysis Under Psilocybin (2025)

Demystifying Softmax Gating Function in Gaussian Mixture of Experts (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MGF-softmax.