Papers
Topics
Authors
Recent
Search
2000 character limit reached

GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

Published 6 Nov 2023 in cs.LG, cs.AI, cs.CV, and cs.CL | (2311.03426v2)

Abstract: Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Speeding up resnet architecture with layers targeted low rank decomposition. arXiv preprint arXiv:2309.12412.
  2. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.
  3. Rethinking attention with performers.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  5. Strategies for applying low rank decomposition to transformer-based models. In 36th Conference on Neural Information Processing Systems (NeurIPS2022).
  6. Flatten transformer: Vision transformer using focused linear attention. arXiv preprint arXiv:2308.00442.
  7. Training compute-optimal large language models.
  8. Token dropping for efficient bert pretraining.
  9. Lora: Low-rank adaptation of large language models.
  10. Farnoosh Javadi Fishani. 2020. Hierarchical part-based disentanglement of pose and appearance. Ph.D. thesis, University of British Columbia.
  11. The power of scale for parameter-efficient prompt tuning.
  12. Fast inference from transformers via speculative decoding.
  13. Swin transformer: Hierarchical vision transformer using shifted windows.
  14. Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. CoRR, abs/1711.05101.
  15. Efficiently scaling transformer inference.
  16. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need.
  17. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR.
  18. Attention is all you need.
  19. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
  20. Internimage: Exploring large-scale vision foundation models with deformable convolutions.
  21. Random-ltd: Random and layerwise token dropping brings efficient training for large-scale transformers.
Citations (6)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.