Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mixture of Heterogeneous Grouped Experts for Language Modeling

Published 25 Apr 2026 in cs.CL, cs.AI, and cs.LG | (2604.23108v2)

Abstract: LLMs based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that fails to align computational costs with varying token-level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system-level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment. To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two-level routing mechanism to enable flexible, resource-aware expert combinations. To optimize inference efficiency, we propose a Group-Wise Auxiliary Loss, which dynamically steers tokens to the most parameter-efficient expert groups based on task difficulty. To address the critical deployment challenge of GPU load balancing, we introduce an All-size Group-decoupling Allocation strategy coupled with an Intra-Group Experts Auxiliary Loss. These mechanisms collectively ensure uniform computation distribution across GPUs. Extensive evaluations demonstrate that MoHGE matches the performance of MoE architectures while reducing the total parameters by approximately 20% and maintaining balanced GPU utilization. Our work establishes a scalable paradigm for resource-efficient MoE design, offering a practical solution for optimizing inference costs in real-world scenarios. The code is publicly available at https://github.com/UnicomAI/MoHGE.

Summary

  • The paper presents a novel two-level routing mechanism that groups heterogeneous experts to allocate computational resources based on token complexity.
  • It employs hardware-aware allocation and auxiliary losses to balance GPU workloads and reduce activated parameters by up to 25%.
  • Empirical results demonstrate improved accuracy and inference efficiency across multiple benchmarks and model scales.

Mixture of Heterogeneous Grouped Experts for Language Modeling (2604.23108)

Motivation and Architecture

The Mixture-of-Experts (MoE) framework for scaling LLMs is widely adopted due to its efficient sparsity, activating only a subset of parameters per inference. Standard MoEs enforce homogeneity—experts of equal size—which restricts their ability to adapt computational resources to token- or task-level complexity. Recent attempts at heterogeneous expert sizes, such as MoDSE and HMoE, face substantial GPU utilization imbalance and inefficient parameter activation, limiting practical deployment scalability.

This paper introduces the Mixture of Heterogeneous Grouped Experts (MoHGE), which organizes experts into groups, each group containing experts with identical parameter size, but different groups varying in size. MoHGE employs a two-level routing mechanism: first, tokens are routed to appropriate groups reflecting task difficulty; then, an expert gating model selects specific experts within those groups. This hierarchical routing, together with auxiliary losses and a hardware-aware allocation strategy, allows MoHGE to match parameter utilization to computational needs while ensuring balanced GPU workloads.

Technical Contributions

Grouped Heterogeneous Experts and Two-Level Routing

MoHGE's design structures the expert set {Eg,i}\{E_{g,i}\} into NgN_g groups, each with NN experts. The hidden dimension WiW_i for group GiG_i increases monotonically, allowing deeper groups to handle more complex tokens. The group gating model selects top-KgK_g relevant groups based on input centroids, and within those, the expert gating model picks top-KeK_e experts, using scores normalized relative to both group and local context. This dual-level gating yields fine-grained, efficient, and flexible expert selection, outperforming traditional flat routing in expressivity and computational alignment.

Efficiency and GPU Load Balancing

Larger experts risk dominating routing due to stronger representational power; to counteract this, MoHGE introduces a Group-Wise Auxiliary Loss, penalizing overuse of large groups. This encourages utilizing smaller, parameter-efficient experts for simpler tokens. The All-size Group-decoupling Allocation strategy distributes the ii-th expert from each group across the ii-th GPU, guaranteeing parameter count uniformity per GPU and preventing bottlenecks. An Intra-Group Experts Auxiliary Loss, adapted from DeepSeekV2, regularizes routing, ensuring balanced activation within groups.

These strategies collectively mitigate system-level hardware imbalance endemic to recent heterogeneous MoE architectures, enabling robust industrial deployment without sacrificing throughput or efficiency.

Empirical Results

Experiments span the 1B, 3B, and 14B parameter scales, evaluating MoHGE, standard MoE baselines, and dense variants using zero/few-shot protocols on benchmarks including MMLU, SIQA, GSM8K, LAMBADA, MATH, PIQA, and TriviaQA.

Key numerical findings:

  • Parameter Efficiency: At 3B and 14B scales, MoHGE reduces total parameters by ~20% and activated expert parameters by ~25% relative to MoE, consistently matching or surpassing MoE performance.
  • Accuracy Gains: MoHGE achieves highest or nearly highest scores across multiple datasets. For instance, at 14B, SIQA rises from 42.29 (Dense) and 44.28 (MoE) to 45.62 (MoHGE); GSM8K jumps from 4.62 (Dense) and 4.92 (MoE) to 5.76.
  • Inference Performance: MoHGE maintains faster or comparable inference durations versus MoE and achieves favorable parameter-performance trade-offs.
  • GPU Utilization: Empirical token routing shows ~12–13% per-group routing frequency per GPU (at 14B scale), confirming effective load balancing. Contradictory to prior claims that heterogeneous expert sizes lead to hardware imbalance, MoHGE demonstrates uniform GPU utilization even at large scales.
  • Token Routing Analysis: Tokens ranked by difficulty or perplexity are preferentially routed to smaller or larger groups, evidencing MoHGE's dynamic adaptation to token complexity.
  • Auxiliary Loss Impact: Ablations indicate group-wise and intra-group losses drive routing diversification, reducing reliance on large experts and minimizing activated parameters without hurting task performance.

In direct comparison to MoDSE and HMoE at 3B, MoHGE achieves balanced GPU utilization and consistently superior downstream accuracy, whereas HMoE, despite comparable accuracy, suffers from GPU imbalance.

Theoretical and Practical Implications

MoHGE offers a scalable approach for designing MoE architectures that align compute and memory resources with task-specific modeling demands. The hierarchical grouping and routing enable flexible expansion without incurring the deployment bottlenecks typical of naive heterogeneous architectures. By penalizing large-group overuse and enforcing intra-group balance, MoHGE maximizes parameter efficiency and maintains uniform hardware load, unlocking practical scaling for industrial-grade LLMs.

These techniques also provide a framework for future research into dynamic expert allocation, adaptive routing policies, and hardware synchronization. MoHGE's group allocation strategy could be further refined for distributed training at even greater scale or for integration with specialized accelerators. The fact that MoHGE achieves comparable performance with reduced parameters suggests implications for energy efficiency and inference cost in large-scale real-world deployments.

Additionally, MoHGE's routing behavior aligns with linguistic token complexity, potentially offering new avenues for fine-tuned, context-sensitive LLMs that allocate modeling effort based on token-level difficulty. Research into even more granular expert capacity specification, layer-wise heterogeneity, or adaptive grouping algorithms could propel further advances in parameter-efficient, hardware-scalable sparse architectures.

Conclusion

MoHGE advances the state of resource-aware language modeling by introducing group-wise heterogeneous experts, hierarchical routing, and auxiliary losses for parameter and hardware efficiency. Empirical results demonstrate ~20% parameter reduction with maintained or improved task accuracy and rigorously balanced GPU utilization across scales. MoHGE establishes a practical paradigm for industrial deployment of sparse LLMs, and its architectural innovations provide fertile ground for continued exploration of efficient, scalable expert allocation in future AI models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.