Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zero-Compute Experts in MoE++

Updated 23 January 2026
  • Zero-compute experts are innovative components in MoE++ that dynamically skip computation for redundant tokens, realizing both data and weight sparsity.
  • They incorporate multiple designs—null, copy, constant, static branches, and neuron-level gating—to minimize FLOPs while maintaining accuracy.
  • Empirical results demonstrate significant speedups and memory savings, with enhanced load balancing and theoretical scaling benefits for scalable deployment.

Zero-compute experts—commonly referenced under the MoE++ designation—are a class of mechanisms for achieving ultra-efficient mixture-of-experts (MoE) architectures by introducing “experts” whose output incurs zero or negligible computation cost for certain tokens. This design realizes data sparsity atop conventional MoE weight sparsity, enabling models to dynamically skip or zero-out computation for uninformative or redundant tokens, or to substitute computational shortcuts with no appreciable accuracy loss. The paradigm is instantiated in several mechanisms: (1) explicit inclusion of trainable null or zero-compute experts in the routing pool (Kilian et al., 21 Jan 2026, Jin et al., 2024), (2) deterministic decomposition of dense MLPs into parallel branches (“static zero-compute experts”) (Novikov, 26 Nov 2025), (3) aggressive expert- or neuron-level pruning and adaptive routing (Muzio et al., 2024, Cheng et al., 7 Oct 2025), and (4) optimized execution kernels that avoid wasted computation for unused experts (Luo et al., 2024). This article provides a comprehensive overview of zero-compute experts in MoE++, with detailed coverage of their architecture, routing and gating, efficiency properties, theoretical analysis, and empirical impact.

1. Architectural Foundations and Variants

Classic MoE architectures consist of NN real experts—each a parameterized feed-forward network (FFN)—and a trainable router that assigns each input token to a subset of KK experts. MoE++ generalizes this by appending MM zero-compute experts to the pool. The principal forms of zero-compute experts are:

  • Null (Zero) Expert: Always outputs zero, serving as a trainable “drop” operation; incurs no FLOPs when selected (Kilian et al., 21 Jan 2026, Jin et al., 2024).
  • Copy (Skip) Expert: Implements Ecopy(x)=xE_\text{copy}(x) = x; acts as a residual connection, allowing the token to bypass the MoE++ layer (Jin et al., 2024).
  • Constant (Replace) Expert: Outputs a trainable vector or learned mixture of input and a bias; parameter and compute negligible compared to FFNs (Jin et al., 2024).
  • Deterministic Static Branches: Decompose a dense MLP into EE parallel branches via simple weight slicing; each branch (“expert”) operates independently, and conditional activation enables sparsity without retraining (Novikov, 26 Nov 2025).
  • Neuron-level Gating: Within an expert, only top-KNK_N neurons (by activation magnitude) are applied, skipping the rest at the cost of near-zero accuracy loss (Cheng et al., 7 Oct 2025).

The zero-compute experts are distinguished by incurring no forward pass compute or communication overhead, and by their negligible parameter footprint. In the case of deterministic static zero-compute experts, branch outputs are computed as part of a regular dense operation, but selective evaluation and pruning can yield compute reduction (Novikov, 26 Nov 2025).

2. Routing, Gating, and Causality

Zero-compute experts are integrated into MoE routing via softmax-based or pathway-aware gating functions:

  • Token-choice (top-K) routing: Each token xtx_t independently receives routing logits for N+MN+M experts: G(xt)RNG(x_t) \in \mathbb{R}^N for real experts, and g0(xt)g_0(x_t) duplicated MM times for null experts. After softmax, the top-K slots are selected; indices N\ge N are interpreted as null experts and incur zero compute (Kilian et al., 21 Jan 2026).
  • Output normalization: Real expert probability weights are renormalized over the subset St{1N}\mathcal S_t \cap \{1\ldots N\} to ensure correctness when zero-compute slots are present.
  • Pathway-aware residual gating: In MoE++, expert selection logits at each layer incorporate a residual from the previous layer’s router outputs, promoting more stable and context-aware gating (Jin et al., 2024).
  • Causality: Zero-compute experts preserve autoregressive causality: routing for token tt uses only xtx_t; no future tokens or batch-level information are needed. This is in contrast to expert-choice routing, which breaks causality in generative models (Kilian et al., 21 Jan 2026).

The high-level effect is that for each token, the router can flexibly select any number rKr\le K of real experts (with the remainder to null/zero-compute), yielding data sparsity without train/inference mismatch or causality violation.

3. Load Balancing and Regularization

To prevent degenerate routing (collapse onto a few experts or onto nulls alone), MoE++ employs expanded load balancing and regularization objectives:

  • Load-balance loss: For slot i{1,,N+M}i\in\{1,\ldots,N+M\}, let fif_i be the fraction of tokens routed and PiP_i the mean router assignment; the auxiliary loss is (N+M)i=1N+MfiPi(N+M)\,\sum_{i=1}^{N+M} f_i\,P_i, encouraging uniform use across real and zero-compute experts (Kilian et al., 21 Jan 2026).
  • Z-loss for router stability: Regularizes softmax normalization via 1Tt=1Tlog2(i=1N+Mexp(G~(xt)i))\frac1T \sum_{t=1}^T \log^2(\sum_{i=1}^{N+M}\exp(\widetilde G(x_t)_i)) (Kilian et al., 21 Jan 2026).
  • Expert- and type-specific weighting: For each expert type (FFN or zero-compute), balance terms are weighted by a hyperparameter ηi\eta_i (Jin et al., 2024).
  • Entropy-based fine-tuning: Post-hoc pruning frameworks use entropy regularization to make routing distributions peaky, further concentrating activation onto a small subset of experts and pushing the rest to zero-compute (Muzio et al., 2024).

Capacity constraints (max tokens per expert) are also type- and device-aware, supporting deployment in distributed or heterogeneous environments (Jin et al., 2024).

4. Computational Efficiency and Theoretical Properties

Zero-compute experts fundamentally expand the MoE design space, enabling both weight (per-token) and data (per-expert) sparsity. Expected compute per token is

$\E[\mathrm{FLOPs}] = F_\mathrm{shared} + (K_{\max}\,\rho)\,F_\mathrm{expert}$

where ρ\rho is the expected fraction of top-KK slots assigned to real experts, and FsharedF_\mathrm{shared} is the dense FFN component (Kilian et al., 21 Jan 2026). Varying (Kmax,ρ)(K_{\max}, \rho) under fixed $\E[R_t]=K_{\max}\rho$ strictly expands the efficiency–loss Pareto frontier relative to weight-only models.

High-rate quantization theory formalizes the approximation and estimation error tradeoff for pure zero-compute MoEs (where each expert is a constant piecewise regressor assigned to a region RiR_i):

  • Approximation error decays as Cdexp(2h(X)/d)N2/dC_d \exp(2h(X)/d) N^{-2/d} where h(X)h(X) is differential entropy and CdC_d depends on smoothness (Dar, 3 Oct 2025).
  • Optimal number of experts scales as Nnd/(d+2)N^*\sim n^{d/(d+2)} for nonparametric regression, balancing approximation and sample scarcity.
  • Learned experts can be pruned or adaptively allocated based on heavy-hitter statistics without adverse impact on global test error up to moderate sparsity levels (Dar, 3 Oct 2025, Muzio et al., 2024).

5. Empirical Impact and Practical Deployment

Empirical studies across multiple workstreams demonstrate consistent improvements in both efficiency and downstream performance:

  • Vision-language modeling: MoE++ with Kmax=8K_{\max}=8, ρ=0.5\rho=0.5 outperforms a 2B-dense backbone and competes with much larger, more complex baselines in OCR and counting (Kilian et al., 21 Jan 2026).
  • Throughput and accuracy: MoE++ yields 1.1–2.1×\times speedups over conventional MoE, with increased or matched evaluation accuracy on language understanding benchmarks (Jin et al., 2024).
  • Modality-aware compute allocation: The model autonomously routes “simple” or low-information tokens (e.g., vision tokens or punctuation) to zero experts, concentrating compute on denser, more semantically loaded tokens, without explicit modality labels (Kilian et al., 21 Jan 2026).
  • Neuron-level sparsity: Pruning up to 60% of neurons via activation magnitude yields negligible degradation and brings further compute savings (Cheng et al., 7 Oct 2025).
  • Static conversion and pruning: Deterministic slicing of MLPs into static zero-compute experts maintains <0.1% proxy perplexity change; up to 20% parameter sparsity is possible with minimal loss (Novikov, 26 Nov 2025).
  • Efficient execution kernels: In-place, dispatch-free operators for expert computation achieve up to 48% memory savings and 4.3×\times speedup by ensuring no redundant calculation for zero-compute experts (Luo et al., 2024).

6. Limitations, Failure Modes, and Extensions

Despite strong performance, several limitations are recognized:

  • Resolution collapse: At very low ρ\rho, the router’s softmax may saturate over the large null-expert block, degrading expert discrimination (“softmax resolution collapse”) (Kilian et al., 21 Jan 2026).
  • Expert redundancy and underutilization: In pathological cases, load balancing may not fully prevent collapse onto a small subset of experts or nulls, especially under dataset shift (Muzio et al., 2024).
  • Deployment challenges: Gating networks still incur parameter and compute overhead unless fully pruned; expert sharding and load balancing across heterogeneous hardware remains complex (Jin et al., 2024, Luo et al., 2024).
  • Generalization to attention or depth: While current work focuses on FFN MoE, extending zero-compute sparsity to attention heads (“Mixture-of-Heads”) or transformer layers (“Mixture-of-Depths”) is an open direction (Kilian et al., 21 Jan 2026).

Proposed solutions include two-stage gating, alternative regularizers (e.g., Dirichlet priors), dynamic expert configuration per layer/task, and application to larger model and data regimes (Kilian et al., 21 Jan 2026, Jin et al., 2024).

7. Research Directions and Theoretical Integration

The integration of zero-compute experts into conditional-compute architectures prompts several research threads:

  • Unified theoretical modeling: MoE++ architectures serve as a bridge between classical quantization, nonparametric estimation, and neural gating, providing quantitative scaling laws for error vs. sparsity (Dar, 3 Oct 2025).
  • Training-free architectural metamorphosis: Post hoc conversion of dense models into static zero-compute branchings (MLPMoE) enables immediate deployment of sparse, modular inference pipelines without retraining or calibration (Novikov, 26 Nov 2025).
  • System-level optimization: Kernel and deployment-level innovations, including expert-specific operators and heterogeneity-aware workload allocation, remove traditional MoE inefficiencies, supporting scalable and flexible training/inference (Luo et al., 2024).
  • Neuron-granular gating: The Mixture-of-Neuron-Experts formulation reinterprets deep MoE as emergent from fine-grained activation sparsity, providing an avenue for native zero-compute execution at the sub-expert level (Cheng et al., 7 Oct 2025).

As zero-compute experts are increasingly adopted, their role as a scalable, causal, and efficient mechanism for dynamic resource allocation in deep models is solidified. Continued development integrates advances in theoretical understanding, optimization, and hardware-software co-design.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zero-Compute Experts (MoE++).