- The paper presents a novel two-level routing mechanism that groups heterogeneous experts to allocate computational resources based on token complexity.
- It employs hardware-aware allocation and auxiliary losses to balance GPU workloads and reduce activated parameters by up to 25%.
- Empirical results demonstrate improved accuracy and inference efficiency across multiple benchmarks and model scales.
Mixture of Heterogeneous Grouped Experts for Language Modeling (2604.23108)
Motivation and Architecture
The Mixture-of-Experts (MoE) framework for scaling LLMs is widely adopted due to its efficient sparsity, activating only a subset of parameters per inference. Standard MoEs enforce homogeneity—experts of equal size—which restricts their ability to adapt computational resources to token- or task-level complexity. Recent attempts at heterogeneous expert sizes, such as MoDSE and HMoE, face substantial GPU utilization imbalance and inefficient parameter activation, limiting practical deployment scalability.
This paper introduces the Mixture of Heterogeneous Grouped Experts (MoHGE), which organizes experts into groups, each group containing experts with identical parameter size, but different groups varying in size. MoHGE employs a two-level routing mechanism: first, tokens are routed to appropriate groups reflecting task difficulty; then, an expert gating model selects specific experts within those groups. This hierarchical routing, together with auxiliary losses and a hardware-aware allocation strategy, allows MoHGE to match parameter utilization to computational needs while ensuring balanced GPU workloads.
Technical Contributions
Grouped Heterogeneous Experts and Two-Level Routing
MoHGE's design structures the expert set {Eg,i​} into Ng​ groups, each with N experts. The hidden dimension Wi​ for group Gi​ increases monotonically, allowing deeper groups to handle more complex tokens. The group gating model selects top-Kg​ relevant groups based on input centroids, and within those, the expert gating model picks top-Ke​ experts, using scores normalized relative to both group and local context. This dual-level gating yields fine-grained, efficient, and flexible expert selection, outperforming traditional flat routing in expressivity and computational alignment.
Efficiency and GPU Load Balancing
Larger experts risk dominating routing due to stronger representational power; to counteract this, MoHGE introduces a Group-Wise Auxiliary Loss, penalizing overuse of large groups. This encourages utilizing smaller, parameter-efficient experts for simpler tokens. The All-size Group-decoupling Allocation strategy distributes the i-th expert from each group across the i-th GPU, guaranteeing parameter count uniformity per GPU and preventing bottlenecks. An Intra-Group Experts Auxiliary Loss, adapted from DeepSeekV2, regularizes routing, ensuring balanced activation within groups.
These strategies collectively mitigate system-level hardware imbalance endemic to recent heterogeneous MoE architectures, enabling robust industrial deployment without sacrificing throughput or efficiency.
Empirical Results
Experiments span the 1B, 3B, and 14B parameter scales, evaluating MoHGE, standard MoE baselines, and dense variants using zero/few-shot protocols on benchmarks including MMLU, SIQA, GSM8K, LAMBADA, MATH, PIQA, and TriviaQA.
Key numerical findings:
- Parameter Efficiency: At 3B and 14B scales, MoHGE reduces total parameters by ~20% and activated expert parameters by ~25% relative to MoE, consistently matching or surpassing MoE performance.
- Accuracy Gains: MoHGE achieves highest or nearly highest scores across multiple datasets. For instance, at 14B, SIQA rises from 42.29 (Dense) and 44.28 (MoE) to 45.62 (MoHGE); GSM8K jumps from 4.62 (Dense) and 4.92 (MoE) to 5.76.
- Inference Performance: MoHGE maintains faster or comparable inference durations versus MoE and achieves favorable parameter-performance trade-offs.
- GPU Utilization: Empirical token routing shows ~12–13% per-group routing frequency per GPU (at 14B scale), confirming effective load balancing. Contradictory to prior claims that heterogeneous expert sizes lead to hardware imbalance, MoHGE demonstrates uniform GPU utilization even at large scales.
- Token Routing Analysis: Tokens ranked by difficulty or perplexity are preferentially routed to smaller or larger groups, evidencing MoHGE's dynamic adaptation to token complexity.
- Auxiliary Loss Impact: Ablations indicate group-wise and intra-group losses drive routing diversification, reducing reliance on large experts and minimizing activated parameters without hurting task performance.
In direct comparison to MoDSE and HMoE at 3B, MoHGE achieves balanced GPU utilization and consistently superior downstream accuracy, whereas HMoE, despite comparable accuracy, suffers from GPU imbalance.
Theoretical and Practical Implications
MoHGE offers a scalable approach for designing MoE architectures that align compute and memory resources with task-specific modeling demands. The hierarchical grouping and routing enable flexible expansion without incurring the deployment bottlenecks typical of naive heterogeneous architectures. By penalizing large-group overuse and enforcing intra-group balance, MoHGE maximizes parameter efficiency and maintains uniform hardware load, unlocking practical scaling for industrial-grade LLMs.
These techniques also provide a framework for future research into dynamic expert allocation, adaptive routing policies, and hardware synchronization. MoHGE's group allocation strategy could be further refined for distributed training at even greater scale or for integration with specialized accelerators. The fact that MoHGE achieves comparable performance with reduced parameters suggests implications for energy efficiency and inference cost in large-scale real-world deployments.
Additionally, MoHGE's routing behavior aligns with linguistic token complexity, potentially offering new avenues for fine-tuned, context-sensitive LLMs that allocate modeling effort based on token-level difficulty. Research into even more granular expert capacity specification, layer-wise heterogeneity, or adaptive grouping algorithms could propel further advances in parameter-efficient, hardware-scalable sparse architectures.
Conclusion
MoHGE advances the state of resource-aware language modeling by introducing group-wise heterogeneous experts, hierarchical routing, and auxiliary losses for parameter and hardware efficiency. Empirical results demonstrate ~20% parameter reduction with maintained or improved task accuracy and rigorously balanced GPU utilization across scales. MoHGE establishes a practical paradigm for industrial deployment of sparse LLMs, and its architectural innovations provide fertile ground for continued exploration of efficient, scalable expert allocation in future AI models.