Vision Transformers are Parameter-Efficient Audio-Visual Learners

Published 15 Dec 2022 in cs.CV, cs.CL, cs.LG, cs.SD, and eess.AS | (2212.07983v2)

Abstract: Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters. To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT. To efficiently fuse visual and audio cues, our LAVISH adapter uses a small set of latent tokens, which form an attention bottleneck, thus, eliminating the quadratic cost of standard cross-attention. Compared to the existing modality-specific audio-visual methods, our approach achieves competitive or even better performance on various audio-visual tasks while using fewer tunable parameters and without relying on costly audio pretraining or external audio encoders. Our code is available at https://genjib.github.io/project_page/LAVISH/

Abstract PDF Upgrade to Chat

Citations (58)

View on Semantic Scholar

Summary

The paper introduces a LAVisH adapter that injects minimal trainable parameters into frozen ViTs to enable efficient audio-visual learning.
It achieves competitive performance with only 10.1 million trainable parameters, reaching 81.1% accuracy on audio-visual event localization.
The approach leverages bi-directional cross-modal fusion to improve audio-visual reasoning in tasks such as segmentation and question answering.

Vision Transformers as Parameter-Efficient Audio-Visual Learners

The paper "Vision Transformers are Parameter-Efficient Audio-Visual Learners" explores the capability of frozen Vision Transformers (ViTs), pretrained solely on visual data, to effectively generalize to audio-visual tasks without the need for finetuning. This study is centered around the development and implementation of a latent audio-visual hybrid (LAVisH) adapter. The LAVisH adapter introduces a novel approach to incorporating audio-visual task capability into ViTs by injecting a minimal number of trainable parameters into each layer of a frozen ViT.

Key Contributions

LAVisH Adapter: The paper introduces a latent audio-visual hybrid adapter that is essential for adapting pretrained ViTs to audio-visual tasks. This approach employs a small set of latent tokens to manage the fusion of visual and audio cues, which effectively mitigates the quadratic costs associated with standard cross-attention mechanisms.
Parameter Efficiency: Unlike modality-specific audio-visual models that often require extensive pretraining on large datasets or rely on external audio encoders, the proposed approach achieves competitive, and in some cases superior, performance with fewer tunable parameters. The reduction in parameter requirements presents significant computational and resource efficiency advantages.
Cross-Modal Fusion: The bi-directional design of the LAVisH adapter allows for a flexible flow of information between audio and visual modalities, facilitating improved joint audio-visual reasoning, which is crucial for complex tasks that require the integration of both auditory and visual data.

Experimental Validation

The LAVisH framework was validated across diverse audio-visual tasks, yielding significant performance enhancements:

Audio-Visual Event Localization: The proposed method achieves high classification accuracy while maintaining efficiency, being capable of performing without any additional audio pretraining.
Audio-Visual Segmentation and Question Answering: The approach demonstrates strong performance in segmentation tasks and question-answering scenarios, underscoring its ability to leverage cross-modal interactions without dependency on separate audio models.

Numerical Results

The paper reports favorable comparisons against state-of-the-art audio-visual models, particularly noting impressive results such as achieving an 81.1% accuracy on the audio-visual event localization task with the Swin-V2-L architecture. This performance is notable given the constraint of maintaining a smaller number of trainable parameters (10.1 million) compared to prior methods that often require substantial computational resources.

Theoretical and Practical Implications

The findings suggest that vision transformers, commonly used in computer vision, can be effectively adapted for audio-visual tasks, broadening their application scope. This capability, enabled through parameter-efficient adaptations, highlights the potential of leveraging pretrained models across multiple domains without extensive retraining or additional resource allocation.

Future Directions

Given the promising results, future work could explore extending the LAVisH adapter's capability to other modalities, such as text, or investigating its application in more interactive and complex settings that require nuanced multi-modal reasoning. Additionally, further refinement could focus on scaling these approaches while preserving efficiency to handle even larger datasets and real-time processing requirements.

Overall, this paper provides a comprehensive account of adapting ViTs for efficient audio-visual learning, showcasing a method that balances performance and computational efficiency. This work lays a foundation for future research in cross-modal learning using vision transformer architectures.