Cause of MoE Underperformance in Interference Robustness
Determine whether the observed underperformance of Mixture-of-Experts (MoE) large language models relative to dense transformer models with comparable total parameter counts on proactive interference–based retrieval tasks is caused by the small number of parameters activated during inference compared to the model’s nominal total parameter count. Specifically, ascertain if the effective activated parameter count is the primary factor explaining lower anti-interference capacity in MoE architectures.
References
MoE architectures underperform dense models with comparable total parameters (we conjecture that this is because the number of activated parameters in an MoE model is much smaller than its nominal total).
— Unable to Forget: Proactive lnterference Reveals Working Memory Limits in LLMs Beyond Context Length
(2506.08184 - Wang et al., 9 Jun 2025) in Section 2, Subsubsection "Size Over Input Context Length"