Are MoE models inherently easier to interpret than dense FFNs?

Determine whether the architectural sparsity of Mixture-of-Experts transformer language models makes them inherently easier to interpret than dense feed-forward networks, specifically with respect to understanding their internal computations and representations.

Background

Mixture-of-Experts architectures activate only a small subset of experts per token, creating structural sparsity that may reduce representational superposition and polysemanticity. This has led to the hypothesis that MoE models could be easier to interpret than dense feed-forward networks, but whether this is intrinsically true had not been established.

The paper positions this uncertainty as a motivating question for their empirical investigation, comparing MoE experts and dense FFNs using k-sparse probing and interpretability analyses to assess relative monosemanticity and clarity of internal computations.

References

While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs).

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level  (2604.02178 - Herbst et al., 2 Apr 2026) in Abstract