CLAP Pseudo-Labels in Audio-Text Modeling

Updated 3 February 2026

The paper introduces pseudo-labels from dual-encoder CLAP models that enhance supervised and semi-supervised learning across audio and multimodal tasks.
CLAP pseudo-labels are generated by projecting audio and text into a shared embedding space using normalized cosine similarity for dense ranking and regression objectives.
These pseudo-labels integrate via ranking, binary cross-entropy, and L2 distillation losses to drive significant performance improvements in classification, parsing, and masked audio modeling.

Contrastive Language-Audio Pretraining (CLAP) pseudo-labels are supervision signals derived from a pretrained dual-encoder CLAP model, designed to bridge the semantic alignment between audio and natural language. Generated by projecting audio-text or audio-label pairs into a shared embedding space and quantifying their association via normalized cosine similarity, these pseudo-labels enable supervised, semi-supervised, and weakly supervised learning on unlabeled or sparsely labeled audio, video, and multimodal datasets. As CLAP models have become foundational for cross-modal representation and zero-shot transfer, their outputs now serve as "teachers," facilitating downstream tasks ranging from audio classification and masked audio modeling to semantic alignment and video parsing.

1. Mathematical Principles of CLAP Pseudo-Labeling

CLAP employs a dual-encoder architecture: an audio encoder $g(\cdot)$ and a text encoder $h(\cdot)$ (or a label prompt encoder), both mapping inputs into a $d$ -dimensional shared space. For audio $x$ and textual input $t$ , the encodings are: $f_{\text{audio}}(x) = g(x) \in \mathbb{R}^d;\qquad f_{\text{text}}(t) = h(t) \in \mathbb{R}^d$ Both are $\ell_2$ -normalized: $a = \frac{f_{\text{audio}}(x)}{\|f_{\text{audio}}(x)\|_2}, \quad u = \frac{f_{\text{text}}(t)}{\|f_{\text{text}}(t)\|_2}$ The core pseudo-label is the cosine similarity: $s_{\text{CLAP}}(x, t) = a^\top u \in [-1, 1]$ For multi-label or multi-class scenarios, the similarity can be computed for each class-prototype text embedding, enabling assignment of a vector of pseudo-labels per sample (Tsutsumi et al., 31 Jan 2026, Braun et al., 2024, Zhou et al., 2024).

2. Pseudo-Label Generation Strategies

Audio-Text Alignment and Dense Ranking

For audio-text alignment (e.g., the XACLE Challenge), pseudo-labels are continuous similarity scores computed for both "positive" (matched) and "negative" (mismatched) pairs. Pseudo-labeling pipelines frequently involve:

Assembling large pools of candidate pairs (e.g., $\sim$ 1M audio-text pairs, 50% matched and 50% random negatives).
Using a CLAP variant fine-tuned on a small, human-aligned set (e.g., HumanCLAP-M2D) to generate $s_{\text{CLAP}}(x, t)$ for each pair, retaining the raw, order-preserving continuous scores (Tsutsumi et al., 31 Jan 2026).

Segment-wise Audio and Video Pseudo-Labels

In parsing tasks, audio or video streams are temporally segmented. For each segment, CLAP yields per-class similarity: $s_{t,c} = \mathrm{softmax}_c\left(\frac{f^a_t}{\|f^a_t\|_2} \cdot (F^T / \|F^T\|_2)^\top\right)$ Thresholding (e.g., $s_{t,c} \geq \tau_a$ ) assigns sparse segment-wise binary pseudo-labels. Aggregation over segments recovers video-level pseudo-labels, and specialized loss functions exploit the "richness" (fractional occurrence of classes and segments) (Zhou et al., 2024).

Patch-level Pseudo-Labels for Masked Audio Modeling

For masked audio modeling, CLAP pseudo-labels are derived by feeding the full unmasked spectrogram to the pretrained CLAP audio encoder, producing patchwise embeddings $T \in \mathbb{R}^{N \times d}$ , then splitting into masked/unmasked sets. Each serves as regression targets for the reconstructed features from an audio MAE backbone, with the distillation loss: $\mathcal{L}_{\text{target}} = \frac{1}{|v|}\|Y_v - T_v\|_2^2 + \frac{1}{|m|}\|Y_m - T_m\|_2^2$ This replaces the standard spectrogram-MSE loss, leading to more semantically grounded representations (Xin et al., 2024).

3. Integration of CLAP Pseudo-Labels in Training Objectives

CLAP pseudo-labels are integrated via several objective functions, tailored to the nature of the downstream task:

a) Listwise ranking (ListNet): For tasks evaluated by rank correlation (e.g., Spearman's ρ), models minimize a cross-entropy between the model-predicted ordering and the teacher CLAP ordering: $P_i = \frac{e^{y_i}}{\sum_{j=1}^N e^{y_j}},\quad Q_i = \frac{e^{\hat{y}_i}}{\sum_{j=1}^N e^{\hat{y}_j}}$

$\mathcal{L}_{\text{ListNet}} = -\sum_{i=1}^N P_i \log Q_i$

(Tsutsumi et al., 31 Jan 2026)

b) Binary cross-entropy on hard or soft pseudo-labels: Used for multi-label classification and video parsing. Hard pseudo-labels are assigned via thresholding; soft pseudo-labels can be retained as regression targets. The combined loss may weight real and pseudo-labels (Braun et al., 2024, Zhou et al., 2024).

c) L2 distillation loss: The model regresses to the CLAP embedding at a fine granularity (e.g., per patch in masked audio modeling), providing a semantic learning target (Xin et al., 2024).

d) Hybrid/auxiliary objectives: Segment-wise richness terms, label denoising via forward-loss flipping, and dual-branch multi-objective training (e.g., distillation + supervised classification) further regularize training and mitigate noise (Zhou et al., 2024, Xin et al., 2024).

4. Empirical Effects and Ablation Studies

Explicit incorporation of CLAP pseudo-labels drives consistent and significant downstream improvements:

Task	Pseudo-Label Application	Gain over Baseline	Reference
Audio-text alignment (XACLE)	ListNet over CLAP scores	SRCC: 0.352→0.598 (+0.246)	(Tsutsumi et al., 31 Jan 2026)
Audio classification (FSD50k)	BCE on CLAP pseudo-labels	mAP: 0.741→0.75; +label corr: 0.767	(Braun et al., 2024)
Video parsing (LLP, audio event-F)	Segment CLAP pseudo-labels + richness	F: 51.3→55.7 (+4.4)	(Zhou et al., 2024)
Masked audio modeling (AudioSet-20K)	CLAP L2 distill vs. spec-MSE	mAP: 37.1→38.2	(Xin et al., 2024)

Pseudo-label ablations reveal that pretraining with CLAP pseudo-labels is the principal performance driver, often rendering additional supervised pretraining steps marginal in their impact (Tsutsumi et al., 31 Jan 2026, Xin et al., 2024).

5. Handling Noise and Label Quality

CLAP pseudo-labels are approximations to human judgments and often noisy, particularly outside domains precisely matching their pretraining data. Mitigation techniques include:

Incorporation of negative samples to ensure dynamic range coverage.
Focus on ranking or order (rather than absolute values), e.g., listwise ranking loss.
Downstream fine-tuning with limited human labels.
Self-label correction and masking: high-confidence contradictions between teacher pseudo-labels and available labels are resolved or masked (Braun et al., 2024).
Label denoising via loss-based flipping, based on anomalously high forward loss statistics (Zhou et al., 2024).

6. Broad Applications of CLAP Pseudo-Labels

CLAP pseudo-labeling underpins a range of weakly and semi-supervised workflows:

Large-scale audio-language modeling: As in the XACLE challenge, CLAP pseudo-labels bootstrap large audio-LLMs, enabling semantic alignment at scale with minimal human supervision (Tsutsumi et al., 31 Jan 2026).
Weakly-supervised event localization and parsing: Segment-level CLAP pseudo-labels allow inference of temporal boundaries in audiovisual streams and fine-grained parsing of events (Zhou et al., 2024).
Masked audio modeling: Patch-level pseudo-labeling enhances semantic abstraction and downstream discriminative power compared to classic reconstruction losses (Xin et al., 2024).
Label enrichment in classification: CLAP pseudo-labels generated under a zero-shot regime supplement or clean weak or noisy labels, improving the effectiveness of small student models for mobile or low-resource applications (Braun et al., 2024).
Federated and privacy-sensitive unsupervised learning: In a broader sense, the principles of CLAP pseudo-labeling are mirrored in federated anomaly detection—though the acronym here refers to a privacy-preserving collaborative learning protocol and not the language-audio model (Al-lahham et al., 2024).

7. Implementation Details and Practical Considerations

Implementation pipelines commonly involve:

Offline pseudo-label computation: Batched embedding extraction and similarity calculation using a frozen CLAP teacher.
Storage of continuous or thresholded pseudo-labels for integration with data loaders.
Training objectives implemented in standard deep learning frameworks (e.g., ListNet loss for ranking in PyTorch).
Hyperparameter tuning, including learning rates (e.g., $1\times 10^{-5}$ for pretraining with AdamW), batch sizes (e.g., 16–128), threshold values for label assignment (e.g., $\tau_a \approx 0.038$ , $\theta = 0.2$ ), and loss weighting coefficients.
Open-sourced codebases (e.g., https://github.com/shiotalab-tmu/tmu-xacle2026) facilitate reproducibility and adoption (Tsutsumi et al., 31 Jan 2026).

The need to address pseudo-label noise, select appropriate architectures (e.g., ViT for masked audio modeling, LLMs with special heads for score regression), and align training objectives with evaluation metrics is emphasized throughout the literature.

CLAP pseudo-labels have emerged as a central paradigm for leveraging the open-set semantic knowledge of large-scale pretrained models, acting as both an alternative and complement to limited human annotation. Their integration into ranking, classification, and reconstruction frameworks has demonstrably advanced the benchmark state of the art across audio, multimodal, and weakly supervised learning tasks (Tsutsumi et al., 31 Jan 2026, Braun et al., 2024, Xin et al., 2024, Zhou et al., 2024).