Foundation Transformers

Published 12 Oct 2022 in cs.LG, cs.CL, and cs.CV | (2210.06423v2)

Abstract: A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3).

Abstract PDF Upgrade to Chat

Citations (24)

View on Semantic Scholar

Summary

The paper introduces Magneto, a unified Transformer architecture that harmonizes Pre-LN, Post-LN, and the novel Sub-LN normalization.
It demonstrates superior performance on language, vision, speech, and multimodal tasks, achieving notable gains in few-shot learning, BLEU scores, and reduced word error rates.
The paper’s stable initialization strategy and comprehensive evaluations suggest Magneto's potential to streamline model training and deployment across diverse AI applications.

Foundation Transformers: An Overview

The paper "Foundation Transformers" introduces a new approach to unify the implementation of Transformer models across multiple domains such as language, vision, speech, and multimodal tasks. Recognizing the current disparity in Transformer configurations—like the Pre-LayerNorm (Pre-LN) for GPT and vision models or Post-LayerNorm (Post-LN) for BERT—the authors propose a single architecture adaptable to diverse applications, named Magneto.

Key Contributions

The authors aim to address several significant problems in current Transformer models:

Unified Architecture: Magneto serves as a general-purpose model, attempting to harmonize the supposedly unified Transformer framework across different modalities.
Sub-LayerNorm (Sub-LN): Proposed as an enhancement on existing LayerNorm strategies, Sub-LN incorporates an additional LayerNorm in each sublayer, promoting better model expressivity.
Stable Initialization: Following the theoretical insights from DeepNet, they deploy a novel initialization strategy intended to enhance training stability, thus supporting better scalability and reducing the model development burden.
Comprehensive Evaluation: Through extensive experiments across commonly used models and tasks, Magneto consistently exceeds the performance of existing Transformer variants.

Experimental Insights

Magneto's performance was evaluated through tasks including language modeling (BERT, GPT), vision modeling (ViT/BEiT), speech recognition, and multimodal integration (BEiT-3). Notable results from the experiments include:

Causal Language Modeling: Magneto demonstrated significant improvements in in-context learning tasks, outperforming both the standard Pre-LN models used in GPT and Normformer, particularly in zero-shot and few-shot settings.
Masked Language Modeling (MLM): It surpassed Post-LN and Pre-LN versions of BERT in the standard GLUE benchmarks, reflecting superior performance in fine-tuning tasks.
Machine Translation: On the OPUS-100 benchmark, Magneto delivered improved BLEU scores compared to Pre-LN and Normformer configurations.
Vision and Vision-Language Tasks: In the domain of computer vision, Magneto achieved higher accuracy and robustness on ImageNet and its variants, as well as improved semantic segmentation results on ADE20k. Furthermore, vision-language pretraining yielded better outcomes on VQA and NLVR2 benchmarks.
Speech Recognition: Across different model sizes, Magneto exhibited a noticeable reduction in word error rates (WER) on the LibriSpeech dataset compared to the Transformer baselines.

Implications and Future Directions

The introduction of Magneto advances the prospect of a singular, versatile Transformer architecture that could effectively cater to a variety of tasks without the necessity for task-specific adaptations. Such a model simplifies hardware optimization, potentially making pretrained models more reusable and adaptable across different applications.

Theoretically backed training stability ensures a more predictable trajectory towards scaling transformer models, reducing overhead associated with hyperparameter tuning and training process supervision. This lays groundwork for further research into scaling Transformer models efficiently.

Future explorations could involve refining the Sub-LayerNorm mechanism, expanding the applicability of Magneto to even larger integrated datasets or more complex multimodal tasks, and further empirically validating its proposed advantages in diverse real-world scenarios. The implications of a unified Transformer model can significantly streamline efforts in machine learning research and practical applications, cascading through advancements in various AI domains.

Markdown Report Issue