Towards a Comprehensive Scaling Law of Mixture-of-Experts

Published 28 Sep 2025 in cs.LG, cs.AI, and cs.CL | (2509.23678v1)

Abstract: Mixture-of-Experts (MoE) models have become the consensus approach for enabling parameter-efficient scaling and cost-effective deployment in LLMs. However, existing scaling laws for dense models are inapplicable to MoE models, which stems from three critical challenges: the multiplicity of influencing factors, their intricate coupling relationships and the non-monotonic nature of their performance impacts. They collectively necessitate a fine-grained investigation into MoE-specific scaling laws. In this work, we perform a systematic decomposition of MoE settings, identifying five key factors that influence model performance from both size and structural perspectives (data size ($D$), total model size ($N$), activated model size ($N_a$), number of active experts ($G$) and the ratio of shared experts ($S$)). Specifically, we design $446$ controlled experiments to characterize their marginal effects, ultimately constructing a comprehensive and precise joint MoE scaling law that considers all essential factors. Furthermore, we derive the theoretically optimal and practically efficiency-aware optimal configurations for $G$, $S$ and $N_a/N$ with detailed analyses. Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size. With the scaling of $N$, the optimal activation parameter ratio of $N_a/N$ becomes sparser. Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.

Abstract PDF Upgrade to Chat

Summary

The paper proposes a comprehensive scaling law for Mixture-of-Experts models by decomposing key factors such as data size, model size, and expert parameters.
The study identifies that the optimal number of active experts is consistently around 7, providing practical guidance for efficient model deployment.
The analysis reveals that optimal configurations for sparsity and expert ratios are largely independent of model architecture and data size, offering actionable guidelines for scaling.

"Towards a Comprehensive Scaling Law of Mixture-of-Experts" (2509.23678)

Introduction

Mixture-of-Experts (MoE) models have become popular in managing the scaling of LLMs due to their parameter-efficient scaling and cost-effective deployment capabilities. However, standard scaling laws used for dense models are not suitable for MoE models, primarily due to three key challenges: multiple influencing factors, intricate coupling relationships between these factors, and non-monotonic impacts on performance. This paper aims to establish a comprehensive scaling law tailored for MoE models by considering these unique factors.

Methodology

The study systematically analyzes five critical factors affecting MoE models: data size ( $D$ ), total model size ( $N$ ), activated model size ( $N_a$ ), number of active experts ( $G$ ), and ratio of shared experts ( $S$ ). Through 446 controlled experiments, the authors decompose these factors to characterize their marginal effects on model performance. The experiments identified the complex relationships among these factors and their impacts on loss, forming the basis for a joint MoE scaling law expressed as:

$L(N, D, N_a, G, S) = (eG + \frac{f}{G} + mS^2 + nS) * (\frac{1}{N^\alpha} + \frac{k}{N_a^\alpha} + h\frac{N_a}{N}) + \frac{a}{N^\alpha} + \frac{b}{D^\beta} + \frac{c}{N_a^\alpha} + \epsilon.$

Results and Analysis

The proposed scaling law offers precise predictions for MoE model losses and suggests optimal configurations for implementing MoE models. Key findings include:

Optimal Number of Active Experts ( $G$ ): The optimal $G$ was found to be approximately 7, independent of both model architecture and data size, aligning with the settings of contemporary MoE models.
Sparsity in Activation ( $N_a / N$ ): As $N$ scales, the ratio $N_a / N$ becomes sparser, supporting the efficient scaling of MoE structures.
Independence of $G$ and $S$ : Optimal configurations for $G$ and $S$ are independent of model architecture and data size, offering a uniform guideline for MoE model design.

Implications and Future Work

The scaling law derived in this paper provides crucial insights for future MoE model design by offering a framework that accurately captures the interactions between model parameters and performance. It is expected to guide the efficient scaling of MoE models in real-world applications and accelerate the development of industry-level LLMs.

Further exploration could validate these scaling laws under larger and diverse MoE architectures and new training objectives, extending beyond current settings. Future work may incorporate factors related to other LLM components, such as attention mechanisms, to develop even more comprehensive guidelines for model scaling.

Conclusion

This study presents a comprehensive MoE scaling law that successfully integrates multiple influencing factors and provides practical insights into configuring MoE models. The proposed scaling law not only advances theoretical understanding but also offers actionable guidance for deploying large-scale, efficient MoE models in practice. As such, it stands to significantly impact the future trajectory of LLM development, particularly in applications demanding both large scale and high efficiency.

Markdown Report Issue