Wavelet GPT: Wavelet Inspired Large Language Models
Abstract: LLMs have ushered in a new wave of artificial intelligence advancements impacting every scientific field and discipline. We live in a world where most of the data around us, e.g., text, audio, and music, has a multi-scale structure. This paper infuses LLMs with a traditional signal processing idea, namely wavelets, during pre-training to take advantage of the structure. Without adding \textbf{any extra parameters} to a GPT-style LLM architecture in an academic setup, we achieve the same pre-training performance almost twice as fast in text, audio, and images. This is done by imposing a structure on intermediate embeddings. When trained for the same number of training steps, we achieve significant gains in performance, which is comparable to pre-training a larger neural architecture. Further, we show this extends to the Long Range Arena benchmark and several input representations such as characters, BPE tokens, bytes, waveform, math expression, and image pixels. Our architecture allows every next token prediction access to intermediate embeddings at different temporal resolutions in every decoder block. We hope this will pave the way for incorporating multi-rate signal processing into pre-training.
- {{\{{TensorFlow}}\}}: A system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283, 2016.
- Discrete cosine transform. IEEE transactions on Computers, 100(1):90–93, 1974.
- Character-level language modeling with deeper self-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 3159–3166, 2019.
- Longformer: The long-document transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6150–6160, 2020. URL https://www.aclweb.org/anthology/2020.emnlp-main.519/.
- Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023a.
- Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems. RSS, 2023b.
- T. Brown et, al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature human behaviour, 7(3):430–441, 2023.
- Chameleon. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. URL https://arxiv.org/abs/2405.09818.
- Sparse transformer. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019. URL https://arxiv.org/abs/1904.10509.
- Rethinking attention with performers. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH.
- Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=Yg6M6i5Zx0.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
- Fernando Flores-Mangas. Discrete waveelet transform. The Washington Post, Spring 2014. URL https://www.cs.toronto.edu/~mangas/teaching/320/slides/CSC320L11.pdf.
- Non-stationary signal processing for bearing health monitoring. International journal of manufacturing research, 1(1):18–40, 2006.
- It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pp. 7616–7633. PMLR, 2022.
- MiniLLM: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5h0qf7IBZZ.
- Enabling factorized piano music modeling and generation with the maestro dataset. In Proceedings of the International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=H1gJq2R5K7.
- K. He et. al. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770, 2016.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, abs/1503.02531, 2015. URL http://arxiv.org/abs/1503.02531.
- Music transformer: Generating music with long-term structure. In International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=rJe4ShAcF7.
- Ke Huang and Selin Aviyente. Wavelet feature selection for image classification. IEEE Transactions on Image Processing, 17(9):1709–1720, 2008.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 5156–5165. PMLR, 2020. URL https://arxiv.org/abs/2006.16236.
- Sparse factorization of square matrices with application to neural attention modeling. Neural Networks, 152:160–168, 2022.
- Wavelet transforms in image processing, 1998.
- Reformer: The efficient transformer. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/forum?id=rkgNKkHtvB.
- A clockwork rnn. In International conference on machine learning, pp. 1863–1871. PMLR, 2014.
- FNet: Mixing tokens with Fourier transforms. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4296–4313, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.319. URL https://aclanthology.org/2022.naacl-main.319.
- Short-long convolutions help hardware-efficient linear attention to focus on long sequences. arXiv preprint arXiv:2406.08128, 2024.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
- Luna: Linear unified nested attention. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021), pp. 1235–1246, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/14319d9cfc6123106878dc20b94fbaf3-Abstract.html.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
- Progen: Language modeling for protein generation. NeurIPS workshop on ML For Structural Biology, 2020.
- Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf), 8(67), 2012.
- Hierarchical transformers are more efficient language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1559–1571, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.117. URL https://aclanthology.org/2022.findings-naacl.117.
- Naomi Nix. Silicon valley is pricing academics out of ai research. The Washington Post, March 2024. URL https://www.washingtonpost.com/technology/2024/03/10/big-tech-companies-ai-research/.
- papers-with code. Language modelling on text8. March 2024. URL https://paperswithcode.com/sota/language-modelling-on-text8.
- Video transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PxoFut3dWW.
- A. Tamkin et. al. Language through a prism: A spectral approach for multiscale language representations. Advances in Neural Information Processing Systems, 33, 2020.
- Sinkhorn transformer: Generating long-form text via randomized greedy sorting. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 9408–9419, 2020a. URL http://proceedings.mlr.press/v119/tay20a.html.
- Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qVyeW-grC2k.
- Local attention. In Proceedings of the International Conference on Learning Representations, 2020b.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
- Prateek Verma. Goodbye wavenet–a language model for raw audio with context of 1/2 million samples. arXiv preprint arXiv:2206.08297, 2022.
- Audio transformers: Transformer architectures for large scale audio understanding. arXiv preprint arXiv:2105.00335, 2021.
- A generative model for raw audio using transformer architectures. 2021 24th International Conference on Digital Audio Effects (DAFx), pp. 230–237, 2021. URL https://api.semanticscholar.org/CorpusID:235683315.
- A framework for contrastive and generative learning of audio representations. arXiv preprint arXiv:2010.11459, 2020.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
- Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS), pp. 17283–17297, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html.
- Wavspa: Wavelet space attention for boosting transformers’ long sequence learning ability. In Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, pp. 27–46. PMLR, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.