Cause of MoE Underperformance in Interference Robustness

Determine whether the observed underperformance of Mixture-of-Experts (MoE) large language models relative to dense transformer models with comparable total parameter counts on proactive interference–based retrieval tasks is caused by the small number of parameters activated during inference compared to the model’s nominal total parameter count. Specifically, ascertain if the effective activated parameter count is the primary factor explaining lower anti-interference capacity in MoE architectures.

Background

The paper introduces the Interference Endurance Score (IES) to quantify LLMs’ robustness to proactive interference in key–value retrieval tasks. Regression analyses indicate that model parameter size is a significant predictor of IES, while context length is not.

When comparing architectures, the authors report that Mixture-of-Experts (MoE) models consistently match or underperform dense models with similar nominal parameter counts, and often perform comparably to much smaller dense models. They conjecture that this gap may be due to MoE models activating far fewer parameters during inference than their nominal totals, potentially limiting interference resistance.

This conjecture highlights an unresolved causal mechanism: whether the effective activated parameter count, rather than nominal total parameters, is the principal determinant of MoE models’ anti-interference performance.

References

MoE architectures underperform dense models with comparable total parameters (we conjecture that this is because the number of activated parameters in an MoE model is much smaller than its nominal total).

Unable to Forget: Proactive lnterference Reveals Working Memory Limits in LLMs Beyond Context Length  (2506.08184 - Wang et al., 9 Jun 2025) in Section 2, Subsubsection "Size Over Input Context Length"