Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Published 12 Nov 2022 in cs.SD and eess.AS | (2211.06687v4)

Abstract: Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-630K and the proposed model are both available to the public.

Abstract PDF Upgrade to Chat

Citations (402)

View on Semantic Scholar

Summary

The paper presents a novel contrastive learning pipeline that integrates audio and language modalities using feature fusion and keyword-to-caption augmentation.
It demonstrates state-of-the-art performance in text-to-audio retrieval and classification tasks on the large-scale LAION-Audio-630K dataset.
The study effectively handles variable-length audio inputs by fusing both global and local features, paving the way for robust model generalization.

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

This paper addresses the challenge of developing effective audio representations by utilizing contrastive learning in a multimodal context involving audio data and natural language descriptions. The study introduces a novel pipeline for contrastive language-audio pretraining that incorporates feature fusion and keyword-to-caption augmentation to enhance model performance across various tasks.

Introduction

While audio data is ubiquitous, effective representation learning is hindered by the labor-intensive annotation process, posing challenges in leveraging large-scale datasets. The contrastive learning paradigm, particularly inspired by CLIP, offers a promising approach by matching data in a shared latent space to overcome these limitations. This work extends such methodologies to audio and language domains, aligning natural language descriptions with corresponding audio data.

Dataset and Methodology

LAION-Audio-630K

The research introduces LAION-Audio-630K, comprising 633,526 audio-text pairs from multiple sources. This large-scale dataset, enhanced through keyword-to-caption augmentation, serves as a foundational resource for training models in an unsupervised manner, promoting richer, more versatile audio representations.

Figure 1: The audio length distribution of Epidemic Sound.

Model Architecture

The proposed architecture synergizes audio and text encoders to facilitate robust training on audio-text pairs. The audio encoder options include PANN and HTSAT, while BERT, RoBERTa, and CLIP transformer serve as text encoders. These encoders project their respective modalities into a shared space for contrastive learning.

Figure 2: The architecture of our proposed model, including audio/text encoders, feature fusion, and keyword-to-caption augmentation.

Feature Fusion

To accommodate variable-length audio, the feature fusion technique integrates both global and local audio features, ensuring efficient processing and model performance. This mechanism is essential for handling diverse input lengths, which is pivotal for realistic audio applications.

Experimental Results

Text-to-Audio Retrieval

The proposed model, utilizing HTSAT-RoBERTa, demonstrates superior performance in text-to-audio retrieval tasks compared to existing benchmarks. The results highlight the efficacy of both large-scale data and innovative training methods like feature fusion and data augmentation.

Audio Classification

The model exhibits strong zero-shot and supervised classification capabilities across multiple datasets, achieving state-of-the-art results in tasks involving environmental sounds and complex audio data.

Figure 3: The attentional feature fusion architecture from \cite{aff}.

Conclusion

The study successfully advances the field of audio-text representation learning through extensive use of contrastive methods and dataset augmentation. The integration of keyword-to-caption augmentation notably enriches the dataset, paving the way for enhanced model generalization. Future research will explore broader dataset applications and refine the model for more downstream audio tasks.

In summary, the paper evidences the potential of large-scale contrastive language-audio pretraining in bridging the gap between audio and language modalities, offering significant improvements in both retrieval and classification tasks.

Markdown Report Issue