GLCLAP: Global-Local Contrastive Audio-Language Model

Updated 30 December 2025

The paper demonstrates that integrating fine-grained supervision with a shared codebook and locality-aware encoder improves cross-modal alignment, evidenced by significant metric gains (e.g., +14.3 pts in grounding tasks).
The model employs a dual-tower design where audio and text features are jointly aggregated via a modality-shared codebook, ensuring robust semantic correspondence between fine-grained and global representations.
The architecture leverages a Hard-Negative Guided Loss and a Locality-Aware Block that retain temporal patterns, thereby enhancing performance across retrieval, classification, tagging, detection, and grounding tasks.

The Global-Local Contrastive Language-Audio Pre-trained Model (GLCLAP), specifically instantiated as MGA-CLAP, represents a significant progression in contrastive learning across audio and language modalities. GLCLAP aims to address limitations in prior models, such as CLAP, which primarily focus on coarse-grained, global alignment but often neglect fine-grained, frame-level correspondence. By integrating both coarse- and fine-grained supervision mechanisms, leveraging a shared codebook architecture, introducing locality preservation in the audio encoder, and employing a hard-negative guided loss for more robust alignment, MGA-CLAP achieves superior or competitive performance across a suite of zero-shot tasks, enhancing explainability and alignment fidelity in multimodal learning (Li et al., 2024).

1. Architectural Foundations and Representation Granularities

GLCLAP, as instantiated in MGA-CLAP, is architecturally defined by a dual-tower encoder system—one for audio and one for text—coordinated via three specialized modules:

Locality-Aware Encoder Block (LA-Block): Integrated into the last transformer block of the audio tower, this module refines frame-level features by removing query-key soft-attention and solely propagating value projections, preserving local (temporal) structure essential for fine-grained audio-text alignment.
Modality-Shared Codebook: Both global audio and text representations are aggregated via a shared set of learnable codewords, facilitating cross-modal bases and enforcing semantic correspondence.
Hard-Negative Guided Loss: Augments the vanilla contrastive loss with hard-negative reweighting, increasing the pressure to distinguish challenging (@@@@10@@@@ negative) audio-caption pairs within the same batch.

Local representations are captured as $P_i\in\mathbb{R}^{T\times D}$ for audio frames and $Q_i\in\mathbb{R}^{N\times D}$ for text tokens, while global representations $\tilde p_i, \tilde q_i\in\mathbb{R}^D$ for clips and captions are codebook-aggregated and serve as the basis for contrastive alignment.

2. Shared Codebook Aggregation and Semantic Alignment

The codebook consists of $M$ learnable vectors $\{z_k\in\mathbb{R}^D\}_{k=1}^M$ , trained end-to-end. For each codeword, affinity scores between codewords and local frame or word vectors are computed:

Audio Aggregation:

$s_{i,k}^{(a)} = \max_{1\leq j\leq T}\frac{\langle P_{i,j}, z_k \rangle}{\eta}, \quad w_{i,k}^{(a)} = \mathrm{Sparsemax}(s_{i,:,}^{(a)})_k,\quad \tilde p_i = \sum_{k=1}^M w_{i,k}^{(a)}z_k$

Text Aggregation:

$s_{i,k}^{(t)} = \max_{1\leq \ell\leq N}\frac{\langle Q_{i,\ell}, z_k \rangle}{\eta}, \quad w_{i,k}^{(t)} = \mathrm{Sparsemax}(s_{i,:,}^{(t)})_k,\quad \tilde q_i = \sum_{k} w_{i,k}^{(t)}z_k$

Sparsemax-induced sparsity ensures that only a small subset of codewords are activated per sample. During contrastive alignment, both modalities are compelled to “vote” for overlapping codeword activations, bridging the granularity gap and establishing implicit frame-to-word semantic alignment. Visualizations demonstrate that individual codewords correspond to discrete event categories (e.g., “dog barking”), with temporal activation only when the event occurs.

Transformer-based encoders traditionally diffuse contextual information globally, which can dilute precise temporal patterns vital for fine-grained event detection. The LA-Block addresses this by:

Removing the query-key attention from the last transformer block of the audio encoder, retaining only the value-projection:

$V = W_vU, \quad U' = V$

Conventional feed-forward, residual, and normalization operations follow.

This design maintains the v–v similarity structure within event boundaries, crucial for event localization, while earlier blocks retain global context. Ablation studies show that the LA-Block alone boosts fine-grained PSDS1 scores by approximately 8.1 points.

4. Hard-Negative Guided Contrastive Loss

Standard symmetric CLAP loss is applied over batchwise global representations:

$\mathcal L_{\rm CLAP}=-\sum_i\log\frac{e^{\langle\tilde p_i,\tilde q_i\rangle/\tau}}{\sum_j e^{\langle\tilde p_i,\tilde q_j\rangle/\tau}} \;\;-\;\;\sum_i\log\frac{e^{\langle\tilde q_i,\tilde p_i\rangle/\tau}}{\sum_j e^{\langle\tilde q_i,\tilde p_j\rangle/\tau}}$

To enhance discrimination, difficulty-based weights for hard negatives are computed:

$\alpha_{i,j} = \frac{B\,\exp(\gamma\,\langle\tilde p_i,\tilde q_j\rangle/\tau)}{\sum_{k=1}^{B}\exp(\gamma\,\langle\tilde p_i,\tilde q_k\rangle/\tau)}$

The loss becomes:

$\mathcal L_{\rm HN}=-\sum_i\log\frac{e^{\langle\tilde p_i,\tilde q_i\rangle/\tau}}{e^{\langle\tilde p_i,\tilde q_i\rangle/\tau}+\sum_{j\neq i}\alpha_{i,j}e^{\langle\tilde p_i,\tilde q_j\rangle/\tau}}$

Symmetry is maintained with analogous terms for text-to-audio direction.

Tuning $\gamma=0.15$ was found most effective; excessive values over-emphasize the most difficult negatives and degrade overall alignment.

5. Training Setup and Optimization

The aggregate objective combines the hard-negative contrastive loss and codeword-norm regularization:

$\mathcal L = \mathcal L_{\rm HN} + \lambda_{\rm reg}\sum_k(\|z_k\|_2^2-1)^2$

with $\lambda_{\rm reg}=0.01$ preventing codeword collapse. Sparsemax and contrastive pressure suffice; explicit quantization is unnecessary.

Key hyperparameters:

Batch size: 128
Learning rate: $5\times10^{-5}$
Epochs: 10 (on 450K audio–caption pairs)
Learnable $\tau$ (init 0.07) and codebook size $M=4096$
One LA-Block included (last transformer block replaced)
Pretrained audio encoder (AST or HTS-AT backbone)
Uniform LR schedule; early stopping via validation.

Model sizes: AST-based ≈ 200M parameters (110M text, 86M audio, plus codebook and heads); HTS-AT ≈ 140M parameters. Typical training epoch requires ~1.5 days on 8×A100 GPUs.

6. Empirical Performance and Task Coverage

GLCLAP demonstrates state-of-the-art or competitive results on eleven zero-shot benchmarks:

Task Type	Dataset(s)	Baseline CLAP	MGA-CLAP (HTS-AT)	Δ
Retrieval	AudioCaps R@1	39.7%	41.8%	+2.1 pts
Classification	VGGSound accuracy	28.6%	31.8%	+3.2 pts
Tagging	FSD50K mAP	52.4%	54.5%	+2.1 pts
Detection	DESED PSDS1	13.1%	26.4%	+13.3 pts
Grounding	TAG PSDSm	34.4%	48.7%	+14.3 pts

Ablations reveal:

Codebook alone boosts AudioCaps R@1 and DESED PSDS1 by 1.3 and 7.0 points, respectively.
LA-Block alone lifts PSDS1 by 8.1 points.
Combining all modules achieves full gains.

Codebook size $M=4096$ is optimal: smaller impairs detection, larger impairs retrieval. A single LA-Block replacement is optimal; further replacements reduce global context and degrade coarse-grained performance.

7. Limitations and Practical Implementation

Observed limitations include occasional confusions among acoustically similar events (e.g., blender vs. vacuum), persistent challenges in polyphonic scenes, and the requirement for task-specific fine-tuning for extremely short or long audio clips.

The training paradigm is scalable, but notable resource requirements persist (∼1.5 days training per epoch on 450K pairs, 8×A100 GPUs). The model structure allows for broad zero-shot applicability; however, complexities in optimizing the balance between codebook size and LA-Block deployments remain open for further research.

8. Significance and Theoretical Implications

By unifying global and local cross-modal representations via a shared codebook, preserving locality in frame features, and mining hard negatives in contrastive learning, GLCLAP exemplifies multi-grained alignment strategies for audio-language modeling. Individual codewords are shown to correlate directly with semantic event categories and establish frame-to-word localization, advancing explainability in alignment and supporting new paradigms in cross-modal retrieval, classification, tagging, detection, and grounding.

A plausible implication is that shared codebook-based architectures may be extensible to other multi-modal foundations, providing robust foundations for explainable, fine-grained alignment in future contrastive pre-training research (Li et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Global-Local Contrastive Language-Audio Pre-trained Model (GLCLAP).

GLCLAP: Global-Local Contrastive Audio-Language Model

1. Architectural Foundations and Representation Granularities

2. Shared Codebook Aggregation and Semantic Alignment

3. Locality Preservation and Refinement: The LA-Block

4. Hard-Negative Guided Contrastive Loss

5. Training Setup and Optimization

6. Empirical Performance and Task Coverage

7. Limitations and Practical Implementation

8. Significance and Theoretical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GLCLAP: Global-Local Contrastive Audio-Language Model

1. Architectural Foundations and Representation Granularities

2. Shared Codebook Aggregation and Semantic Alignment

3. Locality Preservation and Refinement: The LA-Block

4. Hard-Negative Guided Contrastive Loss

5. Training Setup and Optimization

6. Empirical Performance and Task Coverage

7. Limitations and Practical Implementation

8. Significance and Theoretical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research