Papers
Topics
Authors
Recent
Search
2000 character limit reached

MGF-softmax: HE Softmax Reformulation

Updated 9 February 2026
  • MGF-softmax is a reformulation of the softmax function using the moment generating function to significantly reduce circuit depth and computational overhead in homomorphic encryption.
  • It preserves key properties like shift-invariance and asymptotically converges to standard softmax as input dimensions grow, ensuring representational accuracy.
  • Empirical benchmarks on transformers and language models show near-plaintext accuracy with drastically lower runtime and hardware cost.

MGF-softmax is a reformulation of the softmax function leveraging the moment generating function (MGF), specifically designed to address the computational constraints inherent to privacy-preserving machine learning with homomorphic encryption (HE). Traditional softmax incurs substantial multiplicative depth and circuit complexity under HE, posing critical challenges for efficient encrypted inference, particularly in transformer-based architectures. MGF-softmax replaces the softmax denominator with an MGF-based normalization, drastically reducing multiplicative depth and computational overhead, while asymptotically preserving the representational properties of the standard softmax as the input dimension grows (Park et al., 2 Feb 2026).

1. Mathematical Formulation and Principle

MGF-softmax reinterprets the softmax transformation through the probabilistic lens of the moment generating function. For an input vector x=(x1,,xn)x = (x_1, \ldots, x_n), the standard softmax is

σ(x)i=exij=1nexj,i=1,,n.\sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}, \quad i=1, \dots, n.

The denominator, j=1nexj\sum_{j=1}^n e^{x_j}, is recast as an empirical mean approximating the MGF at t=1t=1 for a random variable XX whose samples are {xi}\{x_i\}. The true (ensemble) mean is MX(1)=E[eX]M_X(1) = \mathbb{E}[e^X]. MGF-softmax utilizes this as: softmaxMGF(x)i=exinMX(1).\mathrm{softmax}_{\mathrm{MGF}}(x)_i = \frac{e^{x_i}}{n M_X(1)}. Equivalently, the cumulant generating function KX(1)=lnMX(1)K_X(1) = \ln M_X(1) provides a normalization shift, yielding: softmaxMGF(x)i=exp(xiKX(1)lnn).\mathrm{softmax}_{\mathrm{MGF}}(x)_i = \exp(x_i - K_X(1) - \ln n). This replacement transforms the normalization step from a sum of exponentials to a moment-based scalar, inherently smoother and more tractable for polynomial approximations compatible with HE schemes (Park et al., 2 Feb 2026).

2. Theoretical Properties

Shift-Invariance

MGF-softmax exhibits the same shift-invariance as classical softmax. For any constant cc, shifting all entries by cc leaves outputs unchanged: softmaxMGF(xc)=softmaxMGF(x).\mathrm{softmax}_{\mathrm{MGF}}(x-c) = \mathrm{softmax}_{\mathrm{MGF}}(x). This property, critical for the numerical stability of attention mechanisms in deep learning, is preserved under the MGF-based normalization (Park et al., 2 Feb 2026).

Asymptotic Convergence

MGF-softmax approximates standard softmax more accurately as nn increases. The central limit theorem ensures that, with Y=eXY = e^X,

P(11niexi/E[eX]δ)2[1Φ(δμYnσY)]P\left(|1 - \tfrac{1}{n}\sum_i e^{x_i}/\mathbb{E}[e^X]| \geq \delta\right) \approx 2\left[1-\Phi\left(\frac{\delta \mu_Y \sqrt{n}}{\sigma_Y}\right)\right]

vanishes as nn \rightarrow \infty, with Φ\Phi denoting the standard normal CDF. Therefore, MGF-softmax is asymptotically equivalent to standard softmax under mild distributional conditions (Park et al., 2 Feb 2026).

Multiplicative Depth Reduction

By eliminating division and max-subtraction operations, MGF-softmax requires only basic polynomial operations (mean, variance, exponentiation) implementable via addition and multiplication. The circuit depth is reduced from 8k+9\geq 8k+9 (as in the Chebyshev-based baseline) to k+6k+6, a substantial improvement for encrypted inference where multiplicative depth is a primary bottleneck (Park et al., 2 Feb 2026).

3. Algorithmic Implementation in Homomorphic Encryption

The implementation of MGF-softmax in the CKKS HE scheme relies on ciphertext packing and efficient rotation patterns to compute row-wise means and variances. Given a matrix ARN1×N2A \in \mathbb{R}^{N_1 \times N_2}, row-wise statistics are aggregated using slot rotations and additions. The core computational steps are as follows:

  1. Compute the sample mean μ\mu per row.
  2. Center inputs and accumulate variance σ2\sigma^2.
  3. Construct the cumulant shift KX(1)=μ+12σ2K_X(1) = \mu + \frac{1}{2}\sigma^2 and subtract lnn\ln n.
  4. Apply polynomial exponential AExp(z)exp(z)\mathsf{AExp}(z) \approx \exp(z) via Chebyshev or limit-based approximations, optionally with domain scaling.

This workflow enables a high-throughput, depth-efficient softmax suitable for large-scale inference over encrypted data without requiring costly bootstrapping per arithmetic layer (Park et al., 2 Feb 2026).

Circuit Complexity Comparison

Method Depth # CMult # Rot Bootstraps
Cho et al. 8k+9\geq 8k+9 12k+58\geq 12k+58 2klog2n2k\log_2 n (8k+9)/L\geq \lfloor(8k+9)/L\rfloor
MGF-softmax k+6k+6 k+10k+10 2log2n2\log_2 n (k+6)/L\lfloor (k+6)/L \rfloor

A plausible implication is that homomorphic inference for transformers and ViTs with MGF-softmax can be realized at vastly reduced runtime and hardware cost compared to prior approaches.

4. Empirical Performance and Applications

MGF-softmax has been experimentally validated on ImageNet-1k with Vision Transformers (ViT/DeiT Tiny and Base) and on LLMs such as LLaMA-3.2-1B across diverse NLP benchmarks (Clinc150, Banking77, SST-2). Results indicate that:

  • For inference depths in the range 7–10, MGF-softmax achieves accuracy within 1% of plaintext models.
  • Low-degree variants outperform alternative polynomial baselines (e.g., Powerformer/BPMax) by 6–10% on large-class tasks.
  • Total runtime for a 256×256256 \times 256 softmax drops from 105.74s (baseline) to 3.06s with MGF-softmax, with timing breakdowns attributable primarily to the elimination of bootstrapping and heavy multiplication (Park et al., 2 Feb 2026).

These benchmarks confirm the practical viability of MGF-softmax for encrypted inference in real-world, privacy-sensitive domains such as healthcare and finance.

5. Distributional Assumptions, Limitations, and Extensions

Approximation accuracy of MGF-softmax depends on input element distribution. The principal assumption is i.i.d. entries with approximate normality. For small nn, empirical moment estimates may bias normalization, increasing error. Prospective extensions include:

  • Incorporating higher-order cumulants to accommodate heavy-tailed distributions.
  • Adaptive selection of moments per row to match input statistics.
  • Extending the MGF-reformulation paradigm to other normalization layers (e.g., layer-norm).
  • Integration with hybrid HE + Multi-Party Computation (MPC) protocols to address low-nn or worst-case error instances (Park et al., 2 Feb 2026).

These directions suggest MGF-softmax serves as a template for further innovations in encrypted deep learning and normalization method reformulation for constrained arithmetic domains.

6. Context within Broader Softmax Gating Paradigms

While MGF-softmax directly addresses HE efficiency issues, other softmax-gated models—such as softmax gating in mixture-of-experts (MoE) or adaptive fusion in multimodal architectures—exploit the gating function's nonlinearity for interpretability and dynamic adaptation (Yap et al., 23 Nov 2025, Nguyen et al., 2023). MGF-softmax differentiates itself by treating normalization as a functional transformation amenable to moment-based statistical approximation, as opposed to parameterized gating for mixture allocation or adaptive fusion. Notably, shift-invariance and normalization remain consistent across these paradigms, but MGF-softmax is distinctive in automating the normalization for environments lacking native division or comparison circuits.

7. Research Impact and Ongoing Developments

MGF-softmax marks a significant advance in enabling practical, accurate, and low-latency encrypted inference for deep learning architectures that rely on softmax computations. Its theoretical grounding in classical probability and cumulant expansions, combined with empirical validation on state-of-the-art models, underscores its relevance for privacy-preserving AI. Ongoing research aims to generalize these ideas to broader classes of nonlinearities and normalization schemes, and to refine approximation strategies by leveraging input distribution properties and statistical learning theories (Park et al., 2 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MGF-softmax.