MatFormer: Nested Transformer for Elastic Inference
Abstract: Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, the substantial costs associated with training these models often limit the number of unique model sizes that can be offered. Consequently, practitioners are compelled to select a model that may not be optimally aligned with their specific latency and cost requirements. We present MatFormer, a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints. MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model. During training, we optimize the parameters of multiple nested FFN blocks with varying sizes, enabling the extraction of hundreds of accurate smaller models without incurring additional computational costs. We empirically validate the efficacy of MatFormer across different model classes (decoders and encoders) and modalities (language and vision), demonstrating its potential for real-world deployment. We show that a 850M decoder-only MatFormer LLM (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters, each exhibiting better validation loss and one-shot downstream evaluations than independently trained counterparts. Furthermore, we observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval. Finally, we showcase that speculative decoding with the accurate and consistent submodels extracted from MatFormer can lead to significant reduction in inference latency. Project website: https://devvrit.github.io/matformer/
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Semantic parsing on freebase from question-answer pairs. In Conference on Empirical Methods in Natural Language Processing, 2013. URL https://api.semanticscholar.org/CorpusID:6401679.
- Flexivit: One model for all patch sizes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14496–14506, 2023.
- Piqa: Reasoning about physical commonsense in natural language, 2019.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
- Vision transformer slimming: Multi-dimension searching in continuous optimization space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4931–4941, 2022.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
- Deep learning for instance retrieval: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Palm: Scaling language modeling with pathways, 2022.
- Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Scenic: A jax library for computer vision research and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21393–21398, 2022.
- Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp. 7480–7512. PMLR, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, pp. 7750–7774. PMLR, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Glam: Efficient scaling of language models with mixture-of-experts, 2022.
- Dynamic convnets on tiny devices via nested sparsity. IEEE Internet of Things Journal, 10(6):5073–5082, 2022.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33:9782–9793, 2020.
- Compressing llms: The truth is rarely pure and never simple, 2023.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
- Scaling laws for neural language models. 2020.
- Alex Krizhevsky. Convolutional neural networks for object classification in cuda. University of Toronto, EECE1742S: Programming Massively Parallel Multiprocessors Using CUDA, 2009.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- Beyond distillation: Task-level mixture-of-experts for efficient inference. arXiv preprint arXiv:2110.03742, 2021.
- Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, pp. 5544–5555. PMLR, 2020.
- Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249, 2022.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl˙a˙00276. URL https://aclanthology.org/Q19-1026.
- Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021.
- Race: Large-scale reading comprehension dataset from examinations, 2017.
- The winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, pp. 552–561. AAAI Press, 2012. ISBN 9781577355601.
- Fast inference from transformers via speculative decoding. 2023.
- Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306, 2022.
- Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
- Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018.
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 839–849, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1098. URL https://aclanthology.org/N16-1098.
- Adversarial nli: A new benchmark for natural language understanding, 2020.
- R OpenAI. Gpt-4 technical report. arXiv, pp. 2303–08774, 2023.
- The lambada dataset: Word prediction requiring a broad discourse context, 2016.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
- Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL https://aclanthology.org/P18-2124.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Winogrande: An adversarial winograd schema challenge at scale, 2019.
- Sharcs: Efficient transformers through routing with dynamic width sub-networks. Findings of Empirical Methods in Natural Language Processing, 2023.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–17472, 2022.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
- Primer: Searching for efficient transformers for language modeling. arXiv preprint arXiv:2109.08668, 2021.
- How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Sortednet, a place for every network and every network in its place: Towards a generalized solution for training many-in-one neural networks. arXiv preprint arXiv:2309.00255, 2023.
- Attention is all you need. 2023.
- Superglue: A stickier benchmark for general-purpose language understanding systems, 2020a.
- Multiple networks are more efficient than one: Fast and accurate models via ensembles and cascades. arXiv preprint arXiv:2012.01988, 2020b.
- Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Universally slimmable networks and improved training techniques. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1803–1811, 2019.
- Slimmable neural networks. arXiv preprint arXiv:1812.08928, 2018.
- Hellaswag: Can a machine really finish your sentence?, 2019.
- Ensemble machine learning: methods and applications. Springer, 2012.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.