Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Published 11 Jan 2021 in cs.LG and cs.AI | (2101.03961v3)

Abstract: In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of LLMs by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.

Abstract PDF Upgrade to Chat

Citations (1,718)

View on Semantic Scholar

Summary

The paper introduces a simplified routing mechanism that directs each token to a single expert, significantly reducing computational and communication overhead.
The paper demonstrates up to 7x pre-training speedups and improved training stability by leveraging bfloat16 precision in sparse models.
The paper confirms that multilingual and fine-tuning results are maintained across diverse NLP benchmarks and 101 languages, showcasing scalability.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Introduction

The paper entitled "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (2101.03961) addresses fundamental limitations in the scalability of Mixture of Experts (MoE) models by introducing the Switch Transformer. While MoE models have achieved significant successes, their adoption is constrained by complexity, high communication costs, and training instability. The Switch Transformer simplifies the MoE routing algorithm, reduces computational and communication costs, and achieves training stability. This paper demonstrates the ability to train large sparse models with reduced precision using bfloat16 formats, leveraging the benefits of sparsely-activated expert models without incurring additional computational costs relative to densely-activated alternatives.

Switch Transformer Architecture

The Switch Transformer extends upon the Transformer architecture by incorporating sparsely-activated layers, known as Switch layers, which activate only a subset of weights for each incoming example. This design philosophy prioritizes a large parameter count while maintaining a constant computational cost, effectively scaling neural LLMs without additional computational burden.

Figure 1: Scaling and sample efficiency of Switch Transformers. Left Plot: Scaling properties for increasingly sparse (more experts) Switch Transformers. Right Plot: Negative log perplexity comparing Switch Transformers to T5 models using the same compute budget.

Simplified Sparse Routing

The Switch Transformer implements a simplified routing mechanism, contrasting with previous top- $k$ expert selection strategies that compute non-trivial gradients across multiple experts. The Switch routing strategy uses a simplified mechanism by assigning each token to a single expert, thereby reducing computational overhead and enhancing training stability.

Figure 2: Illustration of a Switch Transformer encoder block. We replace the dense feed forward network layer present in the Transformer with a sparse Switch FFN layer. The layer operates independently on the tokens in the sequence.

Efficient Sparse Routing Implementation

Utilizing Mesh-Tensorflow for efficient sparse routing, the Switch Transformer ensures stability in routing decisions through dynamically set expert capacities and an auxiliary load balancing loss. The implementation maintains computational efficiency, even when faced with uneven token dispatch across multiple experts.

Figure 3: Illustration of token routing dynamics. Each expert processes a fixed batch-size of tokens modulated by the capacity factor.

Experimental Results

The Switch Transformer has demonstrated significant improvements in training speed and efficiency, with reported speedups of up to 7x in pre-training compared to dense models such as T5-XXL. Key improvements are observed in multilingual contexts, where Switch Transformers achieve consistent gains across all 101 languages, recording speedups of 4x or greater for 91% of languages.

Figure 4: Scaling properties of the Switch Transformer. Despite all models using an equal computational budget, we observe consistent improvements scaling the number of experts.

Figure 5: Speed advantage of Switch Transformer. For a fixed amount of computation and training time, Switch Transformers significantly outperform the dense Transformer baseline.

Fine-Tuning and Distillation

The Switch Transformer exhibits substantial improvements during fine-tuning across diverse NLP tasks, including GLUE, SuperGLUE, SQuAD, and TriviaQA. The architecture supports effective model distillation, compressing large sparse models into dense variants while preserving a significant fraction of the quality improvements.

Multilingual and Scaling Design

In multilingual training, Switch Transformers demonstrate effective scaling properties, displaying mean step speed-ups against mT5-Base. These models incorporate a blend of data, model, and expert-parallelism, achieving parameter scales up to 1.6 trillion, showcasing enhanced sample efficiency and reduced training duration compared to large-scale dense models.

Figure 6: Multilingual pre-training on 101 languages. Improvements of Switch T5 Base model over dense baseline when multi-task training on 101 languages.

Figure 7: Multilingual pre-training on 101 languages. We histogram for each language, the step speedup of Switch Transformers over the FLOP matched T5 dense baseline to reach the same quality.

Conclusion

The Switch Transformer establishes a paradigm shift in the design and scalability of large-scale LLMs through its focus on sparse activation and computational efficiency. Its adoption may catalyze future advancements in sparsely-activated architectures, embracing both multilingual contexts and expansive parameter scales.

Markdown Report Issue