LatentMoE Architecture

Updated 2 February 2026

LatentMoE is a Mixture-of-Experts design that uses a low-dimensional latent bottleneck to minimize memory and communication overhead in large-scale language models.
The architecture employs a shared gating network and latent projections to route tokens efficiently, achieving significant gains in accuracy per FLOP and per parameter.
It enables scalable expert routing for both latency-critical and high-throughput modes, reinvesting saved resources to expand expert diversity without sacrificing performance.

LatentMoE is a Mixture-of-Experts (MoE) variant developed to address both hardware and model efficiency bottlenecks, particularly in large-scale LLMs such as NVIDIA’s Nemotron 3 family. The LatentMoE architecture employs a low-dimensional latent bottleneck within each sparse expert layer, enabling scaling of expert count and fan-out by leveraging hardware-friendly reductions in routed tensor size. This systematic compression and reinvestment strategy establishes new Pareto frontiers in accuracy per floating-point operation (FLOP) and accuracy per parameter, with empirical validation at scales up to the trillion-parameter regime (NVIDIA et al., 24 Dec 2025, Elango et al., 26 Jan 2026).

1. Architectural Design and Integration

LatentMoE is implemented as a direct replacement for traditional MoE-FFN sublayers, notably in hybrid Mamba–Transformer stacks as used in Nemotron 3 Super and Ultra. Each LatentMoE layer consists of the following sequential stages:

Input activations $X \in \mathbb{R}^{B\times d}$ are received from a preceding Mamba-2 or attention sublayer.
A shared gating network computes per-token router logits in the full space $d$ , yielding a softmax distribution $G \in \mathbb{R}^{B \times N}$ over $N$ experts:

$G = \mathrm{softmax}(XW_g + b_g)$

For each token $b$ , the Top- $K$ routing mask is derived by zeroing all but the largest $K$ entries in row $G_{b,:}$ , producing a masked gating matrix $\widetilde{G}$ .
The $d$ -dimensional inputs are projected into a latent space of dimension $\ell < d$ :

$H = XW_{\mathrm{lat}}$

with $W_{\mathrm{lat}} \in \mathbb{R}^{d \times \ell}$ .

Each latent vector $H_{b,:}$ is dispatched to its Top- $K$ experts, which compute per-expert FFN transforms entirely in the latent space:

$E_i(H_{b,:}) = \left(\phi(H_{b,:}W_{i}^{(1)} + b_{i}^{(1)})\right) W_{i}^{(2)} + b_{i}^{(2)}$

with $W_{i}^{(1)} \in \mathbb{R}^{\ell \times m}, W_{i}^{(2)} \in \mathbb{R}^{m \times \ell}$ .

Expert outputs are aggregated via the masked gating weights:

$Y_{\mathrm{lat}, b, :} = \sum_{i=1}^N \widetilde{G}_{b, i} E_i(H_{b,:})$

The aggregated latent output $Y_{\mathrm{lat}}$ is projected back to full hidden dimension and residual-connected:

$Y = Y_{\mathrm{lat}} W_{\mathrm{out}} + b_{\mathrm{out}}$

where $W_{\mathrm{out}} \in \mathbb{R}^{\ell \times d}$ , and $X + Y$ is passed to the next layer (NVIDIA et al., 24 Dec 2025, Elango et al., 26 Jan 2026).

This design results in all expert computation and communication being confined to a low-rank latent subspace, while non-expert operations (gating, normalization, residual paths) remain in the full $d$ -dimensional space.

2. Mathematical Formulation of Gating, Routing, and Losses

The LatentMoE routing mechanism adheres to canonical MoE principles but adapts them for the latent setup:

Gating: The softmax router operates in $d$ dimensions, outputting

$g(x) = \mathrm{softmax}(W_g x + b_g) \in \mathbb{R}^N$

Top- $K$ Routing: The routing mask $\widetilde{g}(x)$ is defined as

$\widetilde g_i(x) = \begin{cases} g_i(x), & i \in \operatorname{TopK}(g(x)) \ 0, & \text{otherwise} \end{cases}$

Tokens are mapped into the latent space before being sent to experts, decoupling the communication cost from $d$ .
Load balancing: To ensure equitable expert utilization, the average load and importance of each expert are monitored over a batch. The standard MoE load-balancing loss remains:

$\mathcal{L}_{\mathrm{load}} = \lambda_{\mathrm{load}} N \sum_{j=1}^N P_j^2 \approx \lambda_{\mathrm{load}} \sum_{j=1}^N (P_j - 1/N)^2$

where $P_j = \frac{1}{B} \sum_{b=1}^B \widetilde{g}_j(x_b)$ is the average expert load (NVIDIA et al., 24 Dec 2025).

Total loss is the sum of cross-entropy task loss and auxiliary load-balancing loss (Elango et al., 26 Jan 2026).

3. Hardware-Efficient Scaling and Memory–Bandwidth Analysis

LatentMoE was motivated by analysis revealing two dominant bottlenecks for large-scale MoE models:

Memory-bandwidth bound kernel execution: In low-batch regimes, the limiting factor is the volume of weight reads (size $d \times m$ per expert in standard MoE). LatentMoE projects routed tensors to $\ell \ll d$ , reducing per-expert memory and bandwidth demands by $d/\ell$ .
All-to-all communication bottleneck: In high-throughput (large batch) settings, distributed MoE layers require synchronization of expert inputs. Reducing the tensor size from $d$ to $\ell$ linearly decreases communication payload per token.

By projecting tokens to $\ell$ before dispatch and projecting outputs back after expert aggregation, the system achieves substantial bandwidth and communication volume reductions. The freed memory and compute resources are reinvested to scale the total expert pool ( $N' = N \cdot d/\ell$ ) and optionally active fan-out ( $K' = K \cdot d/\ell$ ), while holding the total FLOP and communication cost constant (NVIDIA et al., 24 Dec 2025, Elango et al., 26 Jan 2026).

4. Theoretical Properties: Nonlinear Capacity and Expressivity

LatentMoE’s scaling strategy is supported by function approximation theory:

Nonlinear capacity: The emergent nonlinear capacity for each token remains $\sim K \cdot m$ , where $K$ is fan-out and $m$ is expert FFN width. Keeping $m$ and $K$ constant during dimension reduction preserves approximation power; scaling them further increases expressivity (Elango et al., 26 Jan 2026).
Combinatorial sparsity: The number of possible expert combinations per token grows as $\binom{N}{K}$ ; scaling both $N$ and $K$ by $\alpha = d/\ell$ results in $(\alpha N \text{ choose } \alpha K) \ge \binom{N}{K}^{\alpha}$ , producing exponential increases in specialization and local diversity.

The up/down latent projections cost $2d\ell$ FLOPs per token. For typical choices (e.g., $\ell = d/4$ ), this constitutes $<9\%$ overhead relative to baseline computation.

5. Empirical Performance and Hyperparameter Choices

LatentMoE has been validated at scales from $\sim$ 16B to 95B total parameters (8B active):

Model Type	$d$	$m$	$N$	$K$	$\ell$	Active Params	MMLU-Pro	MMLU	Code	Math	Commonsense
Standard MoE	4096	16384	128	6	–	8.09B	48.30	70.10	51.95	78.32	81.73
LatentMoE(acc)	4096	16384	512	24	1024	8.02B	52.87	72.11	55.14	80.19	82.10

Nemotron 3 Super and Ultra use $\ell = 1024$ (i.e., $\ell = 0.25d$ ) with expert count and fan-out expanded by $d/\ell = 4$ , yielding per-token routing to up to 22–24 experts. All improvements in accuracy occur with no measured degradation in inference throughput or per-token latency. Across five downstream evaluation categories, LatentMoE achieves $2$–$4$ points absolute gain in accuracy relative to standard MoE (NVIDIA et al., 24 Dec 2025, Elango et al., 26 Jan 2026).

LatentMoE is distinct from earlier factorized MoE variants such as Mixture of Latent Experts (MoLAE, also “MoLE” (Liu et al., 29 Mar 2025)), though both exploit bottlenecked expert computation. MoLAE performs factorization of expert parameters via a latent intermediate and allows conversion of pretrained MoE weights through SVD-based schemes, targeting parameter efficiency. LatentMoE targets bandwidth and communication cost, using the saved resources not for parameter minimization but for scaling up the total expert count and fan-out, thereby expanding network expressivity at fixed hardware budget.

While other approaches (e.g., MoLAE) may reduce parameter count by $O(m/n + 1/k)$ via factorization, LatentMoE adopts a hardware-aware focus: dimension reduction is used as an enabling lever for scaling expert diversity and resultant accuracy per byte, rather than for model shrinkage (Liu et al., 29 Mar 2025, Elango et al., 26 Jan 2026).

7. Deployment Regimes, Limitations, and Future Directions

LatentMoE supports two principal operational regimes:

Online, latency-critical inference (small batch): Memory bandwidth is dominant; reducing tensor dimension from $d$ to $\ell$ directly improves step time.
Offline, high-throughput serving (large batch): All-to-all communication volume dominates; sending $\ell$ -dimensional embeddings minimizes network bottleneck.

Latency, memory, and compute costs are minimized without sacrificing task accuracy. The architecture integrates transparently into hybrid blocks with Mamba-2 and attention, requiring no departure from standard MoE training recipes (e.g., NoisyTopK routing, load balancing, familiar optimizer and numerics).

Potential limitations include dependence on the selection of $\ell$ (too small an $\ell$ may bottleneck expressivity given intrinsic feature rank), and on hardware-specific communication patterns. Open research directions involve dynamic or input-adaptive latent bottleneck selection, further extension of latent routing to non-FFN subnetworks (attention, normalization), and investigation of non-linear bottleneck projections (Elango et al., 26 Jan 2026, Liu et al., 29 Mar 2025).

References

NVIDIA et al., "NVIDIA Nemotron 3: Efficient and Open Intelligence" (NVIDIA et al., 24 Dec 2025)
"LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts" (Elango et al., 26 Jan 2026)
"MoLAE: Mixture of Latent Experts for Parameter-Efficient LLMs" (Liu et al., 29 Mar 2025)