SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

Published 18 Jan 2026 in eess.AS, cs.AI, and cs.SD | (2601.12594v1)

Abstract: Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives. SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations. The proposed SLAP model achieves new state-of-the-art performance on audio-text retrieval and zero-shot audio classification tasks, demonstrating its effectiveness across diverse benchmarks.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a unified framework leveraging a 109M audio-text pair corpus to process variable-duration audio up to 30 seconds using multi-objective training.
It employs a redesigned transformer-based audio encoder with hybrid local and global attention mechanisms, ensuring dense and robust audio representations.
Experimental results show SOTA performance in retrieval, classification, captioning, and tagging tasks, setting a new standard in language-audio pretraining.

Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training: An Expert Review

Introduction and Motivation

SLAP (Scalable Language-Audio Pretraining) presents a substantial evolution over prior CLAP architectures, primarily by overcoming limitations in dataset scale, audio duration flexibility, and training objective granularity. Existing CLAP variants often operate on restricted datasets (millions of samples, compared to CLIP's billion-scale), employ audio encoders pretrained on limited-duration inputs, and are confined to global contrastive loss, thus impeding fine-grained feature representation. SLAP addresses these deficiencies by leveraging a 109M audio-text pair corpus, supporting variable-duration audio up to 30 seconds, and integrating a unified single-stage multi-objective regime. This results in robust, dense, generalizable audio representations suitable for both unimodal and multimodal tasks.

SLAP Model Architecture and Training Framework

Transformer-Based Audio Encoder Redesign

The audio encoder in SLAP is built upon ViT but is extensively reengineered for audio modality. Salient modifications include 2D Rotary Positional Embedding (RoPE) for temporal-spectral structuring, RMSNorm, bias-free linear layers, and the stabilizing SwiGLU activation. To efficiently process long sequences, SLAP incorporates an alternating attention block paradigm—interleaving sliding-window local attention (for computational efficiency and local feature capture) with periodic global attention blocks (to model cross-segment dependencies and capture global context). This hybrid mechanism enables support for variable durations while maintaining fidelity in both micro and macro representation spaces.

Efficient Handling of Variable-Duration Audio

SLAP deploys a sequence packing strategy where mel-spectrograms are partitioned into non-overlapping patches, subsequently concatenated while omitting padding, to produce packed batch inputs. This is processed as a unified long sequence via Flash Attention, circumventing quadratic attention scaling issues and minimizing inter-sample padding. Compared to approaches such as ElasticAST, this yields improved flexibility, computational efficiency, and scale.

Multi-Objective Unified Training

Unlike prior multi-stage pipelines, SLAP unifies contrastive, self-supervised, and captioning objectives in a single training stage (illustrated in Figure 1).

Figure 1: High-level overview of the unified SLAP framework, with explicit annotation of the self-supervised learning flow (in red).

Contrastive Loss: Adopts CLAP’s bidirectional global representation alignment, using multi-head attention pooling for more expressive aggregation versus class token pooling.
Self-Supervised Loss: Implements masked audio modeling with an EMA-updated teacher-student architecture. Random patches are masked; student features are matched to teacher prototypes via cross-entropy on softmax-normalized outputs, supporting dense representation learning.
Captioning Loss: Utilizes a shallow Transformer decoder, trained from scratch, using teacher-forcing to maximize the conditional likelihood of each caption token based on the audio encoder output. This auxiliary loss further semanticizes patch-level features.
Loss weights ( $\alpha, \beta, \gamma$ ) are empirically selected for optimal trade-off.

Experimental Results

Audio-Text Retrieval

SLAP attains SOTA retrieval scores on AudioCaps (R@1: 47.5%, R@5: 79.8%, R@10: 89.2%) and Clotho (R@1: 27.2%, R@5: 55.9%, R@10: 68.3% for text-to-audio; audio-to-text scores are even higher), outperforming prior CLAP and multi-objective methods. Gains are attributed to both dataset scale and the model’s ability to ingest variable-length audio without truncation/cropping, preserving contextual integrity.

Zero-Shot Audio Classification

Performance is benchmarked on ESC-50, UrbanSound8K, CREMA-D, RAVDESS, and GTZAN. After fine-tuning on WavCaps, SLAP $_{Wavcaps}$ achieves top-1 accuracy of 95.5% (ESC-50), 83.5% (US8K), 32.2% (CREMA-D), 29.8% (RAVDESS), and 80.5% (GTZAN), demonstrating superior transferability, especially when additional weakly-labeled data is available for finetuning. Notably, raw SLAP pretraining underperforms slightly compared to models trained on cleaner captions, highlighting the impact of annotation quality in pretraining data.

Audio Captioning

Even with a lightweight decoder and low captioning loss weight, SLAP matches or overtakes previous CLAP-based models in METEOR, CIDEr, and SPICE metrics. On AudioCaps, SLAP scores 24.9 (METEOR), 75.1 (CIDEr), and 18.1 (SPICE); on Clotho, 18.1, 43.7, 13.1, respectively. This validates the efficacy of multi-objective training for semantic feature enrichment.

Audio Tagging

SLAP achieves competitive results in mean average precision and accuracy across AudioSet, ESC-50, and SPC-2, surpassing BLAT and closely rivaling M2D2 (with minor variation due to self-supervised pretraining differences).

Ablation Analysis

Removal of self-supervised or captioning losses significantly degrades performance on retrieval benchmarks, confirming that dense and semantic objectives are crucial. Excluding local attention causes a modest drop, validating the attention mechanism choices.

Practical and Theoretical Implications

SLAP’s architecture sets a precedent for highly scalable multimodal pretraining in the audio domain, comparable to advances in vision-LLMs. The ability to process variable-length inputs removes a longstanding barrier to real-world deployment, especially for applications involving unstructured audio streams. Dense, unified representations offer improved performance for retrieval, classification, tagging, and captioning tasks, and simplify downstream adaptation. The results highlight the necessity of large-scale, flexible, and multi-objective pretraining—suggesting that further gains may be realized by scaling both dataset size and annotation quality.

Future Directions

Possible avenues for extension include:

Incorporating external knowledge or domain-specific metadata during pretraining, potentially via joint training with expert-labeled subsets.
Developing more sophisticated annotation pipelines (e.g., leveraging generative LLMs with audio context for higher caption fidelity).
Exploring hierarchical temporal modeling to better capture long-range dependencies in audio.
Adapting the framework for other modalities (e.g., video-audio-language triple alignment).
Investigating robustness to noisy or adversarial inputs, given the scale and diversity of pretraining data.

Conclusion

SLAP holistically advances the state-of-the-art in scalable language-audio representation learning by combining dataset magnitude, variable-duration flexibility, and unified multi-objective training. Its empirical performance on diverse benchmarks demonstrates its efficacy and broad transferability, establishing new standards for multimodal audio understanding architectures and serving as a strong foundation for future research and practical adoption.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

Summary

Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training: An Expert Review

Introduction and Motivation

SLAP Model Architecture and Training Framework

Transformer-Based Audio Encoder Redesign

Efficient Handling of Variable-Duration Audio

Multi-Objective Unified Training

Experimental Results

Audio-Text Retrieval

Zero-Shot Audio Classification

Audio Captioning

Audio Tagging

Ablation Analysis

Practical and Theoretical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (8)

Collections

Tweets

SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

Summary

Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training: An Expert Review

Introduction and Motivation

SLAP Model Architecture and Training Framework

Transformer-Based Audio Encoder Redesign

Efficient Handling of Variable-Duration Audio

Multi-Objective Unified Training

Experimental Results

Audio-Text Retrieval

Zero-Shot Audio Classification

Audio Captioning

Audio Tagging

Ablation Analysis

Practical and Theoretical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

Tweets