Wavelet GPT: Wavelet Inspired Large Language Models

Published 4 Sep 2024 in eess.SP, cs.AI, cs.CL, cs.LG, cs.SD, and eess.AS | (2409.12924v4)

Abstract: LLMs have ushered in a new wave of artificial intelligence advancements impacting every scientific field and discipline. We live in a world where most of the data around us, e.g., text, audio, and music, has a multi-scale structure. This paper infuses LLMs with a traditional signal processing idea, namely wavelets, during pre-training to take advantage of the structure. Without adding \textbf{any extra parameters} to a GPT-style LLM architecture in an academic setup, we achieve the same pre-training performance almost twice as fast in text, audio, and images. This is done by imposing a structure on intermediate embeddings. When trained for the same number of training steps, we achieve significant gains in performance, which is comparable to pre-training a larger neural architecture. Further, we show this extends to the Long Range Arena benchmark and several input representations such as characters, BPE tokens, bytes, waveform, math expression, and image pixels. Our architecture allows every next token prediction access to intermediate embeddings at different temporal resolutions in every decoder block. We hope this will pave the way for incorporating multi-rate signal processing into pre-training.

Abstract PDF HTML Upgrade to Chat

References (57)

Summary

The paper introduces WaveletGPT, integrating wavelet-based multi-scale representations into transformer embeddings to accelerate large language model pre-training.
Experiments show WaveletGPT pre-training is 40-60% faster than conventional methods, achieving equivalent performance across text, music, and audio.
This approach offers a more resource-effective and scalable method for training large language models, beneficial for academia and industry with limited resources.

Overview of WaveletGPT: Wavelet Inspired LLMs

The paper "WaveletGPT: Wavelet Inspired LLMs" offers a novel approach to accelerating the pre-training of LLMs by integrating multi-scale representation concepts through wavelets. Traditional LLM architectures, such as those based on the GPT (Generative Pretrained Transformer) framework, have been primarily focused on scaling parameters to achieve improved performance across various modalities, including text, audio, and music. However, this study introduces a departure from conventional methods by employing hierarchical signal processing techniques—specifically, wavelets—within the transformer architecture, which significantly enhances pre-training efficiency without adding additional parameters.

Methodology and Core Innovations

The central innovation in this work lies in manipulating intermediate transformer embeddings by applying a wavelet-based hierarchical structure. This method adopts fixed-length Haar wavelet kernels to impose a multi-scale structure on every decoder block's embeddings, enabling each next-token prediction to access these multi-resolution intermediate embeddings. This approach maintains the causality assumption intrinsic to many transformer models, distinguishing it from non-causal methods that utilize full sequence context for computation.

The generalized methodology encompasses re-structuring embeddings such that part of the model dimension maintains its original resolution while the remaining dimensions apply Haar wavelet-inspired averaging at various scales. This transformation leverages the intrinsic hierarchical nature of data, allowing the model to naturally capture different levels of abstraction present in input sequences. Furthermore, this study explores an extension involving learnable wavelet kernels, which consistently provides additional performance improvements.

Performance and Results

Experimental results demonstrate significant reductions in pre-training durations, achieving the same levels of performance 40-60% faster than conventional setups across text, symbolic music, and raw audio domains. When models are scaled down to typical academic architectures, such as a reduced model dimension or depth, similar improvements in efficiency and performance are sustained. These results underscore the efficacy of wavelet-infused intermediate embeddings in driving substantial gains in LLM performance during limited resource settings typical of academic environments.

The study quantifies these improvements through extensive benchmarks on datasets like text-8, achieving notable speedups and reduction in negative log likelihood (NLL) scores, indicative of enhanced pre-training dynamics. Moreover, the applicability is broad, encompassing various data modalities, underpinning the versatility of the wavelet approach in different contexts.

Implications and Future Research Directions

This research opens new pathways for incorporating classical signal processing techniques within contemporary AI architectures, specifically targeting the pre-training phase of large-scale models. By focusing on optimizing the internal structural representation within models rather than sheer parametric escalation, WaveletGPT contributes to a more resource-effective LLM training paradigm.

The implications of this work extend into facilitating more sustainable AI development, reducing computational overhead, and providing a scalable framework for academia and industry operating under resource constraints. The authors further signify the potential for this approach across various architectures and contexts, from simple linear transformer benchmarks in the Long Range Arena (LRA) tasks to more sophisticated hybrid models.

Future work may explore further integration of multi-rate signal processing and explore how wavelet-inspired representations can enhance real-time processing tasks, such as adaptive inference or fine-tuning in dynamic environments. Additionally, expanding on the learnable aspect of wavelet transformation kernels could yield more nuanced control over representation hierarchies, optimizing them for specific tasks or datasets.