- The paper proposes a comprehensive scaling law for Mixture-of-Experts models by decomposing key factors such as data size, model size, and expert parameters.
- The study identifies that the optimal number of active experts is consistently around 7, providing practical guidance for efficient model deployment.
- The analysis reveals that optimal configurations for sparsity and expert ratios are largely independent of model architecture and data size, offering actionable guidelines for scaling.
"Towards a Comprehensive Scaling Law of Mixture-of-Experts" (2509.23678)
Introduction
Mixture-of-Experts (MoE) models have become popular in managing the scaling of LLMs due to their parameter-efficient scaling and cost-effective deployment capabilities. However, standard scaling laws used for dense models are not suitable for MoE models, primarily due to three key challenges: multiple influencing factors, intricate coupling relationships between these factors, and non-monotonic impacts on performance. This paper aims to establish a comprehensive scaling law tailored for MoE models by considering these unique factors.
Methodology
The study systematically analyzes five critical factors affecting MoE models: data size (D), total model size (N), activated model size (Na​), number of active experts (G), and ratio of shared experts (S). Through 446 controlled experiments, the authors decompose these factors to characterize their marginal effects on model performance. The experiments identified the complex relationships among these factors and their impacts on loss, forming the basis for a joint MoE scaling law expressed as:
L(N,D,Na​,G,S)=(eG+Gf​+mS2+nS)∗(Nα1​+Naα​k​+hNNa​​)+Nαa​+Dβb​+Naα​c​+ϵ.
Results and Analysis
The proposed scaling law offers precise predictions for MoE model losses and suggests optimal configurations for implementing MoE models. Key findings include:
- Optimal Number of Active Experts (G): The optimal G was found to be approximately 7, independent of both model architecture and data size, aligning with the settings of contemporary MoE models.
- Sparsity in Activation (Na​/N): As N scales, the ratio Na​/N becomes sparser, supporting the efficient scaling of MoE structures.
- Independence of G and S: Optimal configurations for G and S are independent of model architecture and data size, offering a uniform guideline for MoE model design.
Implications and Future Work
The scaling law derived in this paper provides crucial insights for future MoE model design by offering a framework that accurately captures the interactions between model parameters and performance. It is expected to guide the efficient scaling of MoE models in real-world applications and accelerate the development of industry-level LLMs.
Further exploration could validate these scaling laws under larger and diverse MoE architectures and new training objectives, extending beyond current settings. Future work may incorporate factors related to other LLM components, such as attention mechanisms, to develop even more comprehensive guidelines for model scaling.
Conclusion
This study presents a comprehensive MoE scaling law that successfully integrates multiple influencing factors and provides practical insights into configuring MoE models. The proposed scaling law not only advances theoretical understanding but also offers actionable guidance for deploying large-scale, efficient MoE models in practice. As such, it stands to significantly impact the future trajectory of LLM development, particularly in applications demanding both large scale and high efficiency.