Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation

Published 20 Nov 2025 in eess.AS and cs.AI | (2511.16757v1)

Abstract: Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-LLMs like CLIP serve as widely adopted foundations, existing audio-LLMs primarily excel at retrieval tasks with limited adoption as general-purpose encoders. We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation. To this end, we introduce CaptionStew, a 10.7M caption dataset aggregating diverse open-source audio-text corpora across multiple domains and captioning styles. Using this resource, we conduct the first comprehensive evaluation comparing contrastive and captioning objectives for audio representation learning across speech, music, and environmental sound tasks. Our results demonstrate that audio-language pretraining yields competitive, transferable representations. Through systematic data-scaling experiments, we reveal complementary objective strengths: contrastive learning achieves superior data efficiency at smaller scales, while captioning demonstrates better scalability on language-involved audio understanding tasks. We also find that common supervised initialization practices provide diminishing returns at scale, challenging current approaches. These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations, guiding future research. To accelerate progress, we release data preparation recipes, training protocols, and pretrained models, paving the way toward universal audio understanding.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel audio-language pretraining framework using the diverse CaptionStew dataset to overcome limitations in data scale and caption diversity.
It systematically compares contrastive and captioning objectives, showing that contrastive methods excel at linear probing while captioning scales better for intricate language tasks.
The research reaffirms audio-language pretraining as a competitive approach for general-purpose audio representation, with broad applicability across audio domains.

"Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation" (2511.16757)

Introduction

Audio-language pretraining represents a promising yet underexplored avenue for general-purpose audio understanding. This study identifies significant limitations in current audio-LLMs, primarily focusing on their restricted scope as efficient general-purpose encoders compared to their vision-language counterparts like CLIP. The paper introduces three key challenges in audio-language pretraining: limited large-scale audio-text corpora, a lack of caption diversity, and insufficient systematic exploration and evaluation. To address these, the authors present CaptionStew, a comprehensive 10.7 million caption dataset sourced from diverse audio-text corpora, and provide a thorough evaluation of contrastive and captioning objectives for audio representation learning. The key contributions of this paper are systematic data-scaling experiments and analyzing performance across varied audio-related tasks, establishing that audio-language pretraining is competitive in this domain.

Language-Audio Pretraining

The paper explores two principal methodologies for audio-language pretraining: contrastive and captioning objectives (Figure 1). Contrastive pretraining aligns audio and textual representations in a shared embedding space using a symmetric InfoNCE loss. It optimizes model training by maximizing similarity between paired samples and minimizing it between non-paired samples. This method enhances data efficiency, demonstrating superior performance, particularly at smaller scales. Conversely, the captioning objective takes a generative approach, focusing on generating textual descriptions from audio. It offers dense token-level supervision, aligning well with developing trends in comprehensive audio understanding systems and scaling better in language-involved audio tasks.

Figure 1: Audio-language pretraining objective studied in this work: contrastive and captioning.

CaptionStew Dataset

The authors address the limitations of existing audio-caption datasets by introducing CaptionStew, an extensive dataset compiling 9.3 million audio samples with 10.7 million captions. This dataset spans multiple domains, such as speech, music, and environmental sounds, enhancing both the scale and diversity necessary for robust model training. CaptionStew emphasizes the importance of caption diversity and semantic richness, drawing from various established open-source corpora. These include Audiocaps and WavCaps, among others, ensuring broad coverage of audio-related semantics. This diverse dataset serves as a foundation for evaluating model performance across different pretraining objectives.

Experimental Setup

The experimental framework leverages the Zipformer-M architecture (Figure 2) for audio encoding, chosen due to its efficiency with long sequences and rapid convergence. The model evaluation spans three primary protocols: linear probing, audio-language alignment, and open-form question answering. This setup enables an analysis of representation capabilities across audio domains, ensuring a comprehensive evaluation of pretrained models. The paper systematically compares pretrained models against several benchmarks using these evaluation protocols.

Figure 2: Model diagram of Zipformer.

Main Results

The study demonstrates that contrastive pretraining tends to outperform the captioning approach on linear probing tasks, especially on discriminative tasks like audio event classification and speaker identification. However, captioning objectives show better scalability for tasks requiring intricate language understanding, especially when paired with generative tasks. The paper also sheds light on the diminishing returns of initializing models with supervised pretraining, challenging established practices in the field. Notably, CaptionStew-trained models show strong cross-domain applicability, suggesting their potential in developing general-purpose audio representations.

Figure 3: Data scaling behavior of contrastive vs. captioning objectives across representative tasks.

Conclusion

The research re-establishes audio-language pretraining as a viable pathway for general-purpose audio representation learning. By demonstrating that diverse linguistic sources can enhance semantic richness and representation robustness, the study sets a new direction for future research. This paper's insights into the scaling behavior of contrastive versus captioning objectives and the impact of supervised initialization offer practical guidance for audio representation learning frameworks. Overall, the authors provide essential tools, models, and datasets that pave the way for future advancements in universal audio understanding.

Markdown Report Issue