GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values
Abstract: Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.
- Speeding up resnet architecture with layers targeted low rank decomposition. arXiv preprint arXiv:2309.12412.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints.
- Rethinking attention with performers.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Strategies for applying low rank decomposition to transformer-based models. In 36th Conference on Neural Information Processing Systems (NeurIPS2022).
- Flatten transformer: Vision transformer using focused linear attention. arXiv preprint arXiv:2308.00442.
- Training compute-optimal large language models.
- Token dropping for efficient bert pretraining.
- Lora: Low-rank adaptation of large language models.
- Farnoosh Javadi Fishani. 2020. Hierarchical part-based disentanglement of pose and appearance. Ph.D. thesis, University of British Columbia.
- The power of scale for parameter-efficient prompt tuning.
- Fast inference from transformers via speculative decoding.
- Swin transformer: Hierarchical vision transformer using shifted windows.
- Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. CoRR, abs/1711.05101.
- Efficiently scaling transformer inference.
- Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR.
- Attention is all you need.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
- Internimage: Exploring large-scale vision foundation models with deformable convolutions.
- Random-ltd: Random and layerwise token dropping brings efficient training for large-scale transformers.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.