GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Published 13 Dec 2021 in cs.CL | (2112.06905v2)

Abstract: Scaling LLMs with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of LLMs named GLaM (Generalist LLM), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.

Abstract PDF Upgrade to Chat

Authors (27)

First 10 authors:

Citations (651)

View on Semantic Scholar

Summary

The paper demonstrates that GLaM achieves superior NLP performance using a sparsely activated MoE architecture that processes only a fraction of its 1.2 trillion parameters per token.
It details how GLaM reduces energy usage and computational overhead by using one-third of GPT-3’s training energy and half the FLOPs in inference.
GLaM exhibits improvements of 10.2% in zero-shot, 6.3% in one-shot, and 4.4% in few-shot settings, highlighting its efficiency and scalability in diverse NLP tasks.

GLaM: Efficient Scaling of LLMs with Mixture-of-Experts

The paper "GLaM: Efficient Scaling of LLMs with Mixture-of-Experts" focuses on developing LLMs using a sparsely activated mixture-of-experts (MoE) approach to enhance scalability while reducing computational demands. The Generalist LLM (GLaM) is proposed, which leverages this architecture to achieve competitive performance with fewer computing resources than traditional dense models.

Key Contributions

GLaM is notable for its impressive scale and efficiency. The largest version of GLaM contains 1.2 trillion parameters, making it approximately seven times larger than GPT-3, yet it only uses one-third of the energy required to train GPT-3 and half the FLOPs in inference. This represents a significant reduction in computational overhead while maintaining superior performance across various NLP benchmarks.

Numerical Results

The paper compares GLaM against GPT-3 over zero, one, and few-shot performance across 29 NLP tasks. GLaM consistently surpasses GPT-3 with improvements of 10.2% in zero-shot, 6.3% in one-shot, and 4.4% in few-shot settings, illustrating its enhanced learning efficiency. These results emphasize GLaM's potential for energy-efficient learning and robust task performance.

Methodology

GLaM's architecture combines dense and conditional computation, utilizing sparsely activated MoE layers where each token activates only a small subset of models' parameters. This novel approach allows GLaM to process data efficiently, activating only 96.6 billion of the model’s 1.2 trillion parameters per input token. Additionally, the inclusion of a robust data quality strategy underpins GLaM’s high performance, demonstrating that data quality is pivotal even at substantial model sizes.

Implications and Future Directions

The introduction of MoE-based architectures, such as GLaM, signals a promising direction towards achieving high-quality NLP models that are both scalable and energy-efficient. Given GLaM’s strong performance and reduced resource demands, future exploration should focus on refining these sparse architectures and improving model parallelism algorithms.

Further investigation into the optimal ratio of data quality to quantity is warranted. Since GLaM shows that quality-enhanced datasets yield better outcomes, this insight could guide how datasets are curated and utilized in future large-scale models. Moreover, the potential for application-specific adaptations of GLaM in contexts such as open-domain question answering or language understanding tasks remains a fertile ground for exploration.

Conclusion

The paper articulates the advantages of employing MoE architectures in LLMs, as seen with GLaM, which achieves significant advancements in scaling efficiency and performance. By reducing computational costs while enhancing efficacy across a suite of NLP tasks, GLaM represents a viable pathway for developing the next generation of LLMs with practical implications in both energy savings and model scalability.

Markdown Report Issue