Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wavelet GPT: Wavelet Inspired Large Language Models

Published 4 Sep 2024 in eess.SP, cs.AI, cs.CL, cs.LG, cs.SD, and eess.AS | (2409.12924v4)

Abstract: LLMs have ushered in a new wave of artificial intelligence advancements impacting every scientific field and discipline. We live in a world where most of the data around us, e.g., text, audio, and music, has a multi-scale structure. This paper infuses LLMs with a traditional signal processing idea, namely wavelets, during pre-training to take advantage of the structure. Without adding \textbf{any extra parameters} to a GPT-style LLM architecture in an academic setup, we achieve the same pre-training performance almost twice as fast in text, audio, and images. This is done by imposing a structure on intermediate embeddings. When trained for the same number of training steps, we achieve significant gains in performance, which is comparable to pre-training a larger neural architecture. Further, we show this extends to the Long Range Arena benchmark and several input representations such as characters, BPE tokens, bytes, waveform, math expression, and image pixels. Our architecture allows every next token prediction access to intermediate embeddings at different temporal resolutions in every decoder block. We hope this will pave the way for incorporating multi-rate signal processing into pre-training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. {{\{{TensorFlow}}\}}: A system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp.  265–283, 2016.
  2. Discrete cosine transform. IEEE transactions on Computers, 100(1):90–93, 1974.
  3. Character-level language modeling with deeper self-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp.  3159–3166, 2019.
  4. Longformer: The long-document transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6150–6160, 2020. URL https://www.aclweb.org/anthology/2020.emnlp-main.519/.
  5. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  6. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023a.
  7. Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems. RSS, 2023b.
  8. T. Brown et, al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  9. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature human behaviour, 7(3):430–441, 2023.
  10. Chameleon. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. URL https://arxiv.org/abs/2405.09818.
  11. Sparse transformer. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019. URL https://arxiv.org/abs/1904.10509.
  12. Rethinking attention with performers. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH.
  13. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=Yg6M6i5Zx0.
  16. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  17. Fernando Flores-Mangas. Discrete waveelet transform. The Washington Post, Spring 2014. URL https://www.cs.toronto.edu/~mangas/teaching/320/slides/CSC320L11.pdf.
  18. Non-stationary signal processing for bearing health monitoring. International journal of manufacturing research, 1(1):18–40, 2006.
  19. It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pp.  7616–7633. PMLR, 2022.
  20. MiniLLM: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5h0qf7IBZZ.
  21. Enabling factorized piano music modeling and generation with the maestro dataset. In Proceedings of the International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=H1gJq2R5K7.
  22. K. He et. al. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770, 2016.
  23. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, abs/1503.02531, 2015. URL http://arxiv.org/abs/1503.02531.
  24. Music transformer: Generating music with long-term structure. In International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=rJe4ShAcF7.
  25. Ke Huang and Selin Aviyente. Wavelet feature selection for image classification. IEEE Transactions on Image Processing, 17(9):1709–1720, 2008.
  26. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp.  5156–5165. PMLR, 2020. URL https://arxiv.org/abs/2006.16236.
  27. Sparse factorization of square matrices with application to neural attention modeling. Neural Networks, 152:160–168, 2022.
  28. Wavelet transforms in image processing, 1998.
  29. Reformer: The efficient transformer. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/forum?id=rkgNKkHtvB.
  30. A clockwork rnn. In International conference on machine learning, pp.  1863–1871. PMLR, 2014.
  31. FNet: Mixing tokens with Fourier transforms. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4296–4313, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.319. URL https://aclanthology.org/2022.naacl-main.319.
  32. Short-long convolutions help hardware-efficient linear attention to focus on long sequences. arXiv preprint arXiv:2406.08128, 2024.
  33. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3431–3440, 2015.
  34. Luna: Linear unified nested attention. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021), pp.  1235–1246, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/14319d9cfc6123106878dc20b94fbaf3-Abstract.html.
  35. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  36. Progen: Language modeling for protein generation. NeurIPS workshop on ML For Structural Biology, 2020.
  37. Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf), 8(67), 2012.
  38. Hierarchical transformers are more efficient language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Findings of the Association for Computational Linguistics: NAACL 2022, pp.  1559–1571, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.117. URL https://aclanthology.org/2022.findings-naacl.117.
  39. Naomi Nix. Silicon valley is pricing academics out of ai research. The Washington Post, March 2024. URL https://www.washingtonpost.com/technology/2024/03/10/big-tech-companies-ai-research/.
  40. papers-with code. Language modelling on text8. March 2024. URL https://paperswithcode.com/sota/language-modelling-on-text8.
  41. Video transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  42. A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PxoFut3dWW.
  43. A. Tamkin et. al. Language through a prism: A spectral approach for multiscale language representations. Advances in Neural Information Processing Systems, 33, 2020.
  44. Sinkhorn transformer: Generating long-form text via randomized greedy sorting. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp.  9408–9419, 2020a. URL http://proceedings.mlr.press/v119/tay20a.html.
  45. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qVyeW-grC2k.
  46. Local attention. In Proceedings of the International Conference on Learning Representations, 2020b.
  47. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  49. Attention is all you need. In Advances in neural information processing systems, pp.  5998–6008, 2017.
  50. Prateek Verma. Goodbye wavenet–a language model for raw audio with context of 1/2 million samples. arXiv preprint arXiv:2206.08297, 2022.
  51. Audio transformers: Transformer architectures for large scale audio understanding. arXiv preprint arXiv:2105.00335, 2021.
  52. A generative model for raw audio using transformer architectures. 2021 24th International Conference on Digital Audio Effects (DAFx), pp.  230–237, 2021. URL https://api.semanticscholar.org/CorpusID:235683315.
  53. A framework for contrastive and generative learning of audio representations. arXiv preprint arXiv:2010.11459, 2020.
  54. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  55. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  56. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS), pp.  17283–17297, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html.
  57. Wavspa: Wavelet space attention for boosting transformers’ long sequence learning ability. In Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, pp.  27–46. PMLR, 2024.

Summary

  • The paper introduces WaveletGPT, integrating wavelet-based multi-scale representations into transformer embeddings to accelerate large language model pre-training.
  • Experiments show WaveletGPT pre-training is 40-60% faster than conventional methods, achieving equivalent performance across text, music, and audio.
  • This approach offers a more resource-effective and scalable method for training large language models, beneficial for academia and industry with limited resources.

Overview of WaveletGPT: Wavelet Inspired LLMs

The paper "WaveletGPT: Wavelet Inspired LLMs" offers a novel approach to accelerating the pre-training of LLMs by integrating multi-scale representation concepts through wavelets. Traditional LLM architectures, such as those based on the GPT (Generative Pretrained Transformer) framework, have been primarily focused on scaling parameters to achieve improved performance across various modalities, including text, audio, and music. However, this study introduces a departure from conventional methods by employing hierarchical signal processing techniques—specifically, wavelets—within the transformer architecture, which significantly enhances pre-training efficiency without adding additional parameters.

Methodology and Core Innovations

The central innovation in this work lies in manipulating intermediate transformer embeddings by applying a wavelet-based hierarchical structure. This method adopts fixed-length Haar wavelet kernels to impose a multi-scale structure on every decoder block's embeddings, enabling each next-token prediction to access these multi-resolution intermediate embeddings. This approach maintains the causality assumption intrinsic to many transformer models, distinguishing it from non-causal methods that utilize full sequence context for computation.

The generalized methodology encompasses re-structuring embeddings such that part of the model dimension maintains its original resolution while the remaining dimensions apply Haar wavelet-inspired averaging at various scales. This transformation leverages the intrinsic hierarchical nature of data, allowing the model to naturally capture different levels of abstraction present in input sequences. Furthermore, this study explores an extension involving learnable wavelet kernels, which consistently provides additional performance improvements.

Performance and Results

Experimental results demonstrate significant reductions in pre-training durations, achieving the same levels of performance 40-60% faster than conventional setups across text, symbolic music, and raw audio domains. When models are scaled down to typical academic architectures, such as a reduced model dimension or depth, similar improvements in efficiency and performance are sustained. These results underscore the efficacy of wavelet-infused intermediate embeddings in driving substantial gains in LLM performance during limited resource settings typical of academic environments.

The study quantifies these improvements through extensive benchmarks on datasets like text-8, achieving notable speedups and reduction in negative log likelihood (NLL) scores, indicative of enhanced pre-training dynamics. Moreover, the applicability is broad, encompassing various data modalities, underpinning the versatility of the wavelet approach in different contexts.

Implications and Future Research Directions

This research opens new pathways for incorporating classical signal processing techniques within contemporary AI architectures, specifically targeting the pre-training phase of large-scale models. By focusing on optimizing the internal structural representation within models rather than sheer parametric escalation, WaveletGPT contributes to a more resource-effective LLM training paradigm.

The implications of this work extend into facilitating more sustainable AI development, reducing computational overhead, and providing a scalable framework for academia and industry operating under resource constraints. The authors further signify the potential for this approach across various architectures and contexts, from simple linear transformer benchmarks in the Long Range Arena (LRA) tasks to more sophisticated hybrid models.

Future work may explore further integration of multi-rate signal processing and explore how wavelet-inspired representations can enhance real-time processing tasks, such as adaptive inference or fine-tuning in dynamic environments. Additionally, expanding on the learnable aspect of wavelet transformation kernels could yield more nuanced control over representation hierarchies, optimizing them for specific tasks or datasets.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 14 likes about this paper.

HackerNews