Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

Published 11 Jul 2022 in cs.CV and cs.LG | (2207.04978v1)

Abstract: Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly employ down-sampling operations (e.g., average pooling) over keys/values to dramatically reduce the computational cost. In this work, we argue that such over-aggressive down-sampling design is not invertible and inevitably causes information dropping especially for high-frequency components in objects (e.g., texture details). Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way. This proposal enables self-attention learning with lossless down-sampling over keys/values, facilitating the pursuing of a better efficiency-vs-accuracy trade-off. Furthermore, inverse wavelet transforms are leveraged to strengthen self-attention outputs by aggregating local contexts with enlarged receptive field. We validate the superiority of Wave-ViT through extensive experiments over multiple vision tasks (e.g., image recognition, object detection and instance segmentation). Its performances surpass state-of-the-art ViT backbones with comparable FLOPs. Source code is available at \url{https://github.com/YehLi/ImageNetModel}.

Abstract PDF Upgrade to Chat

Citations (114)

View on Semantic Scholar

Summary

The paper introduces Wave-ViT that integrates discrete wavelet transforms into Vision Transformers for lossless down-sampling and enhanced feature resolution.
The paper demonstrates that this hybrid design reduces computational cost while preserving high-frequency details, achieving 85.5% top-1 accuracy on ImageNet.
The paper shows that the wavelet-based approach improves performance in object detection and segmentation on COCO while using fewer parameters.

An Expert Overview of "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning"

The paper "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning" presents a novel approach to enhancing the efficiency of Vision Transformers (ViTs) by integrating wavelet transforms. This work addresses a fundamental computational challenge with Transformers in visual representation learning—the quadratic scaling of self-attention computation with respect to the number of input patches. The proposed solution, Wavelet Vision Transformer (Wave-ViT), leverages wavelet theory to introduce invertible down-sampling within the Transformer framework, ensuring information is preserved while reducing computational costs.

Theoretical and Architectural Contributions

Wave-ViT introduces a new architectural component termed the Wavelets block, which effectively integrates Discrete Wavelet Transform (DWT) with Transformer self-attention mechanisms. This integration is built on the premise that typical down-sampling operations, such as average pooling, result in information loss—particularly affecting high-frequency components crucial for capturing textural details. The paper posits that the use of wavelet transforms allows for lossless down-sampling, preserving essential image details while decreasing the number of operations required.

The Wavelets block operates by transforming input keys and values into four wavelet subbands using DWT, which are subsequently processed with convolution operations to impose spatial locality. Following self-attention computation, inverse DWT is applied to obtain a high-resolution feature map, which is combined with the self-attention output. This methodology not only maintains image details but also provides enhanced feature contextualization through enlarged receptive fields.

Empirical Results

The empirical evaluation of Wave-ViT demonstrates significant advancements over state-of-the-art ViT backbones across diverse vision tasks, including image recognition, object detection, and instance segmentation. Specifically, Wave-ViT achieves a top-1 accuracy of 85.5% on ImageNet for image recognition, surpassing Pyramid Vision Transformer (PVT) with an absolute improvement of 1.7%. In the domain of object detection and segmentation on the COCO dataset, Wave-ViT also exhibits superior performance with 1.3% and 0.5% mAP increases, respectively, while utilizing 25.9% fewer parameters compared to PVT.

The paper's framework makes bold claims regarding the trade-off between computational efficiency and model accuracy, facilitated by the wavelet-transformed, multi-scale vision architecture. These claims are substantiated through comprehensive experimental validation across different model sizes and numerous benchmarking tasks.

Implications and Future Directions

The integration of wavelet transforms in Transformer architectures opens new avenues for further research into efficient visual representation learning. The immediate practicality of Wave-ViT lies in its potential to enhance the representational capacity of vision models while conserving computational resources, making it particularly useful for high-resolution image processing tasks.

Future developments could explore the application of wavelet-augmented Transformers in other modalities, such as audio and time-series data, where multi-scale and frequency-preserving analyses are crucial. Moreover, extending this framework to support even finer granularity of multi-scale feature extraction and attention could drive further improvements in visual understanding tasks.

In conclusion, the "Wave-ViT" paper enriches the field of visual representation learning by unifying wavelet transforms with Transformer architectures, presenting a compelling case for this hybrid approach both theoretically and empirically. Its contributions not only demonstrate strong numerical results but also lay the groundwork for future innovations in AI model design and efficiency.

Markdown Report Issue