Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention

Published 4 Nov 2024 in cs.CL, cs.AI, and cs.LG | (2411.02063v1)

Abstract: Improving the effectiveness and efficiency of LLMs simultaneously is a critical yet challenging research goal. In this paper, we find that low-rank pre-training, normally considered as efficient methods that will compromise performance, can be scalably effective when reduced parameters are precisely targeted. Specifically, applying the low-dimensional module only to the attention layer -- resolves this issue and enhances both effectiveness and efficiency. We refer to this structure as Low-dimensional Projected Attention (LPA) and provide an explanatory analysis. Through extensive experimentation at parameter scales of 130M, 370M, and scaling up to 3B, we have validated the effectiveness and scalability of LPA. Our results show that LPA model can save up to 12.4% in time while achieving an approximate 5% improvement in test perplexity (ppl) and on downstream tasks compared with the vanilla Transformer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Layer normalization. arXiv preprint arXiv:1607.06450.
  2. The fifth pascal recognizing textual entailment challenge. In Proceedings of Text Analysis Conference.
  3. Low-rank bottleneck in multi-head attention models. In International conference on machine learning, pages 864–873. PMLR.
  4. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Dsee: Dually sparsity-embedded efficient tuning of pre-trained language models. arXiv preprint arXiv:2111.00160.
  7. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pages 177–190. Springer.
  8. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  9. Sparse low-rank adaptation of pre-trained language models. arXiv preprint arXiv:2311.11696.
  10. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, pages 1–16.
  11. Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005).
  12. Kunihiko Fukushima. 1975. Cognitron: A self-organizing multilayered neural network. Biological Cybernetics, 20:121–136.
  13. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  14. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pages 1–9, Prague. Association for Computational Linguistics.
  15. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, volume 7.
  16. Pre-trained models: Past, present and future. AI Open, 2:225–250.
  17. Parameter-efficient transfer learning for NLP. In Proceedings of ICML.
  18. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  19. Sparse structure search for parameter-efficient tuning. arXiv preprint arXiv:2206.07382.
  20. Yerlan Idelbayev and Miguel A Carreira-Perpinán. 2020. Low-rank compression of neural nets: Learning the rank of each layer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8049–8059.
  21. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866.
  22. Exploring low rank training of deep neural networks. arXiv preprint arXiv:2209.13569.
  23. The power of scale for parameter-efficient prompt tuning. In Proceedings of EMNLP.
  24. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of ACL, pages 4582–4597, Online. Association for Computational Linguistics.
  25. Relora: High-rank training through low-rank updates. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023).
  26. Hotcake: Higher order tucker articulated kernels for deeper cnn compression. In 2020 IEEE 15th International Conference on Solid-State & Integrated Circuit Technology (ICSICT), pages 1–4. IEEE.
  27. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434.
  28. Pointer sentinel mixture models. ArXiv, abs/1609.07843.
  29. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  30. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  31. Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. Advances in Neural Information Processing Systems, 35:20051–20063.
  32. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  33. Elrt: Efficient low-rank training for compact convolutional neural networks. arXiv preprint arXiv:2401.10341.
  34. Spdf: Sparse pre-training and dense fine-tuning for large language models. arXiv preprint arXiv:2303.10464.
  35. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  36. Glue: A multi-task benchmark and analysis platform for natural language understanding.
  37. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
  38. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
  39. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  40. Growing efficient deep networks by structured continuous sparsification. arXiv preprint arXiv:2007.15353.
  41. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. ArXiv preprint, abs/2106.10199.
  42. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.
  43. Inrank: Incremental low-rank learning. arXiv preprint arXiv:2306.11250.
  44. Galore: Memory-efficient llm training by gradient low-rank projection. ArXiv, abs/2403.03507.
  45. Bowen Zhou and Ning Ding. 2024. Generative ai for complex scenarios: Language models are sequence processors.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.