Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compressed Experts in Neural Networks

Updated 18 February 2026
  • Compressed experts are techniques that replace full-capacity MoE experts with compact, learnable surrogates, effectively reducing memory and compute costs.
  • They use a main/auxiliary design where auxiliary experts are aggregated via learned embeddings, achieving lower latency and significant parameter savings.
  • Applications include language models, ASR, and neural data compression, with reported reductions in active parameters up to 33.8% and improvements in inference speed.

Compressed experts represent a class of techniques and architectural modules for reducing the memory and compute overhead of Mixture-of-Experts (MoE) neural networks, particularly when scaling large models or deploying on resource-constrained hardware. The essential idea is to replace full-capacity experts—usually parameter-rich feed-forward networks—with compact, learnable surrogates that capture most of the functional capacity of the original expert or an aggregate of multiple auxiliary experts. Recent research has demonstrated that compressed experts enable significant reductions in active parameter count and inference cost, often recovering the vast majority of the performance of the original MoE model. This approach applies to LLMs, speech recognition, and implicit neural representation schemes, among other domains (He et al., 1 Mar 2025, Zhao et al., 2024, Zhao et al., 2023).

1. Background: The MoE Paradigm and Motivation for Compression

MoE models augment standard neural architectures (e.g., Transformers) by replacing certain layers—typically the feed-forward subcomponents—with parallel "experts." For each input token or coordinate, a router (gating network) selects a small subset (top‑k) of experts to activate, resulting in sublinear scaling of compute relative to parameter count. In practice, only a fraction of selected experts drive performance on a given sample, especially after downstream fine-tuning. Empirical studies reveal that many experts contribute negligibly or redundantly, prompting the introduction of compressed experts to maintain accuracy while reducing the number of active parameters and forward passes (He et al., 1 Mar 2025).

2. Construction and Integration of Compressed Experts in MoE Architectures

The most widely studied compressed-expert mechanism operates by splitting the top‑k activated experts for a given token into "main" experts and "auxiliary" experts. The outputs of main experts are retained in full, while the influence of auxiliary experts is represented via a single, learned embedding vector per expert. This procedure unfolds as follows in each MoE layer (He et al., 1 Mar 2025):

  • Compute routing weights α via softmax over the router output for input h, then select top‑k experts indexed {i₁,...,i_k}.
  • Designate the first k_m indices as main experts and the remaining k_a = k−k_m as auxiliary experts.
  • For auxiliary experts, compute normalized routing weights α'i and form an aggregate compressed embedding θ = Σ{i=k_m+1}k α'_i θ_i.
  • Augment the hidden state via elementwise multiplication: h~=hθ\widetilde{h} = h \odot \theta.
  • Perform the MoE aggregation only over main experts: y=m=1kmαimEim(h~)y = \sum_{m=1}^{k_m} \alpha_{i_m} E_{i_m}(\widetilde{h}).

All embedding vectors θ_i are trained alongside the main FFN experts via the standard supervised loss. This architectural modification yields a layer structure with a parallel bank of compressed-expert embeddings but identical router and selection logic (He et al., 1 Mar 2025).

In LoRA-based models, compressed experts can be realized via low-rank adapters, which enable similar reductions in parameter count with only a minor computational overhead (Zhao et al., 2024).

3. Quantitative Impact on Model Size, Inference Cost, and Task Performance

The introduction of compressed experts leads to substantial reductions in active parameter count and computation, with controlled effects on accuracy:

  • In Phi-MoE (2 out of 16 experts per layer active), using Top‑1 routing augmented with compressed experts for auxiliaries recovers 94.5% of the full Top‑2 performance, with a 22% reduction in wall-clock latency and a 33.8% reduction in active parameters per forward pass (He et al., 1 Mar 2025).
  • In OLMoE (8 out of 64 experts), a similar approach achieves 96.3% performance retention and an 18.4% latency reduction, with a 31.4% reduction in active parameters.
  • For end-to-end ASR, integrating LoRA-based compressed experts in SAML yields a 7× reduction in model size (Whisper base.en from 278 MB FP32 to 38.3 MB quantized) and an additional 25–30% relative word error rate improvement over quantization alone. SAML models achieve near-baseline accuracy with extremely compact parameterization: the extra routing and LoRA parameters add only 1–2 MB (Zhao et al., 2024).

This demonstrates that compressed experts constitute a practical mechanism for scaling MoE models to large expert banks or deploying models on devices with limited memory and compute budgets.

4. Methodologies for Constructing and Training Compressed Experts

For parameterized compressed experts in LLMs (He et al., 1 Mar 2025):

  • Each auxiliary expert is associated with a dd-dimensional embedding θi\theta_i, typically initialized to a vector of ones and trained via backpropagation. No separate compression-specific loss is used; the embeddings are updated jointly with the main expert parameters.
  • The split between main and auxiliary experts (k_m vs. k_a) is fixed in current instantiations (typically half–half), although this is not guaranteed to be optimal for all settings.
  • Compressed experts can also be realized via low-rank adaptation modules (e.g., LoRA), where the low-rank update is constructed as ΔW=(α/r)BA\Delta W = (\alpha/r) B A for rank rdr \ll d, achieving up to 40× parameter savings for typical values (Zhao et al., 2024).
  • Post-training quantization and, optionally, entropy coding further compress the parameters, with little or no degradation in accuracy or fidelity (Zhao et al., 2024, Zhao et al., 2023).

For implicit neural representation compression (MoEC) (Zhao et al., 2023):

  • The model replaces hand-crafted partitioning with a learnable router that selects from a bank of compact "local" neural networks (experts), each responsible for a region of the input space.
  • Joint training aligns the router and expert set end-to-end, with load-balancing regularization.

5. Expert-Level Compression Strategies: Pruning, Analysis, and Limitations

Expert-level compression encompasses not only embedding-based surrogates but also pruning of less essential experts. Recent work has demonstrated a pronounced heterogeneity in expert importance in LLM MoEs, culminating in the identification of "Super Experts," a tiny subset responsible for outlier activations and critical attention sink formation (Su et al., 31 Jul 2025). Pruning these crucial experts results in catastrophic collapse of model performance (e.g., a 6.9× increase in perplexity and as much as a 93–97% drop in reasoning and coding accuracy in Qwen3-30B-A3B), while random pruning has no such effect. Best practices now dictate profiling for Super Experts via down_proj activation outliers and sparing them from pruning or overly aggressive quantization.

While compressed experts are effective for the broad population of nonessential experts, extreme care must be taken with expert-level compression schemes to avoid disrupting the rare but mechanistically critical subpopulation of experts that drive model competence.

6. Applications and Extensions in Neural Data Compression and Speaker Adaptation

Compressed experts are not limited to LLMs. In the context of neural data compression, such as MoEC for 3D biomedical volumes, a compact expert bank with adaptive routing learns an optimal, data-driven partitioning of the input space. The resulting system attains state-of-the-art rate-distortion trade-offs, maintaining high reconstruction fidelity (e.g., PSNR ≥48 dB at 6000× compression) with no manual effort in partition design (Zhao et al., 2023).

For end-to-end ASR, SAML applies compressed LoRA experts to quantized backbone models, allowing for rapid, few-shot test-time adaptation to speaker-specific conditions. Only the small router and expert parameters are updated during adaptation, while the base model remains frozen in low-precision, yielding robust accuracy gains at minimal cost (Zhao et al., 2024).

7. Open Problems and Future Directions

Current compressed-expert frameworks fix the main/auxiliary split per layer and use simple additive embedding aggregation. Future work aims to develop adaptive, task-specific policies for allocating main versus compressed expert capacity, richer regularization schemes (including distillation matching full-expert Jacobians), and potential unification with expert pruning and low-rank expert factorization (He et al., 1 Mar 2025, Su et al., 31 Jul 2025).

Mixed-precision schemes that allocate quantization budgets adaptively across experts—especially protecting Super Experts—are underexplored. Further, the possibility of distilling Super Expert behaviors or merging critical features into low-rank surrogates offers a pathway toward expert-level compression with negligible loss of model mechanistic function (Su et al., 31 Jul 2025).

In summary, compressed experts constitute a vital technology for advancing the efficient scaling and deployment of MoE neural architectures, combining aggressive parameter reduction with mechanisms that retain essential model capability across diverse modalities.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compressed Experts.