Taming Sparsely Activated Transformer with Stochastic Experts

Published 8 Oct 2021 in cs.CL and cs.LG | (2110.04260v3)

Abstract: Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most on-going research focuses on improving SAMs models by exploring methods of routing inputs to experts, our analysis reveals that such research might not lead to the solution we expect, i.e., the commonly-used routing methods based on gating mechanisms do not work better than randomly routing inputs to experts. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts). Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference. THOR models are trained using a consistency regularized loss, where experts learn not only from training data but also from other experts as teachers, such that all the experts make consistent predictions. We validate the effectiveness of THOR on machine translation tasks. Results show that THOR models are more parameter efficient in that they significantly outperform the Transformer and MoE models across various settings. For example, in multilingual translation, THOR outperforms the Switch Transformer by 2 BLEU scores, and obtains the same BLEU score as that of a state-of-the-art MoE model that is 18 times larger. Our code is publicly available at: https://github.com/microsoft/Stochastic-Mixture-of-Experts.

Abstract PDF Upgrade to Chat

Citations (94)

View on Semantic Scholar

Summary

The paper introduces THOR, which replaces complex gating networks with randomized expert activation to improve parameter utilization in neural machine translation.
It employs a consistency regularized loss to align expert predictions, enhancing model generalization across low-resource, rich-resource, and multilingual settings.
Empirical results demonstrate THOR’s scalability, outperforming the Switch Transformer by 2 BLEU points while maintaining a smaller, more efficient architecture.

Taming Sparsely Activated Transformer with Stochastic Experts

The paper "Taming Sparsely Activated Transformer with Stochastic Experts" introduces a novel approach to enhancing the parameter efficiency of Sparsely Activated Models (SAMs), notably the Mixture-of-Experts (MoE) models, within the context of neural machine translation tasks. SAMs, which allow models to scale to a massive number of parameters without a proportional increase in computational costs, suffer from inefficiencies related to parameter utilization. The research presents a critical analysis of traditional gating mechanisms commonly employed to route inputs to experts, unveiling that these mechanisms perform on par with methods that randomly route inputs, thus questioning their efficacy.

The proposed model, named THOR, refines the architecture of expert-based models by adopting stochastic activation of experts, where the routing of inputs to experts is randomized during both training and inference phases. This approach theoretically addresses the imbalance issues endemic in previous models driven by gating mechanisms. The randomness in expert activation in THOR obviates the need for complex gating networks, thus reducing model complexity and simultaneously ensuring each expert is adequately trained.

A key contribution of the work is the introduction of a consistency regularized loss function. This ensures that while individual experts learn from data, they also align their predictions with those from other experts, encouraging consensus across the model. This mechanism not only streamlines training but also fortifies the generalization capacity of the model by leveraging inter-expert guidance as a form of regularization, thereby achieving more robust output consistency.

The effectiveness of the THOR model is empirically validated across various machine translation benchmarks, including low-resource, rich-resource, and multilingual settings. The results demonstrate that THOR models exhibit superior parameter efficiency relative to state-of-the-art MoE and traditional Transformer models, delivering enhanced performance despite smaller model sizes. Notably, THOR outperforms the Switch Transformer by 2 BLEU points in multilingual translation tasks while matching the performance of significantly larger MoE models, exemplifying its scalability advantages.

From a practical standpoint, the THOR framework simplifies deployment, reducing model size without sacrificing performance, thereby potentially lowering operational costs. Theoretical implications extend to the enhancement of model generalization through a dynamically balanced stochastic routing framework, suggesting directions for future research focused on applying similar methodologies to other domains besides translation, including natural language understanding and generation tasks.

In conclusion, this research delineates a substantial stride towards optimizing sparsely activated models by rethinking the routing problem. The approach not only challenges conventional routing paradigms but also extends the operational scope of SAMs, suggesting that further exploration into stochastic routing and regularization techniques could substantially refine performance metrics of existing AI systems. Future developments may include experimenting with varying expert sizes or altering the stochastic selection processes, thereby opening new avenues in the efficient deployment of large-scale neural networks.