Papers
Topics
Authors
Recent
Search
2000 character limit reached

GLCLAP: Global-Local Contrastive Audio-Language Model

Updated 30 December 2025
  • The paper demonstrates that integrating fine-grained supervision with a shared codebook and locality-aware encoder improves cross-modal alignment, evidenced by significant metric gains (e.g., +14.3 pts in grounding tasks).
  • The model employs a dual-tower design where audio and text features are jointly aggregated via a modality-shared codebook, ensuring robust semantic correspondence between fine-grained and global representations.
  • The architecture leverages a Hard-Negative Guided Loss and a Locality-Aware Block that retain temporal patterns, thereby enhancing performance across retrieval, classification, tagging, detection, and grounding tasks.

The Global-Local Contrastive Language-Audio Pre-trained Model (GLCLAP), specifically instantiated as MGA-CLAP, represents a significant progression in contrastive learning across audio and language modalities. GLCLAP aims to address limitations in prior models, such as CLAP, which primarily focus on coarse-grained, global alignment but often neglect fine-grained, frame-level correspondence. By integrating both coarse- and fine-grained supervision mechanisms, leveraging a shared codebook architecture, introducing locality preservation in the audio encoder, and employing a hard-negative guided loss for more robust alignment, MGA-CLAP achieves superior or competitive performance across a suite of zero-shot tasks, enhancing explainability and alignment fidelity in multimodal learning (Li et al., 2024).

1. Architectural Foundations and Representation Granularities

GLCLAP, as instantiated in MGA-CLAP, is architecturally defined by a dual-tower encoder system—one for audio and one for text—coordinated via three specialized modules:

  • Locality-Aware Encoder Block (LA-Block): Integrated into the last transformer block of the audio tower, this module refines frame-level features by removing query-key soft-attention and solely propagating value projections, preserving local (temporal) structure essential for fine-grained audio-text alignment.
  • Modality-Shared Codebook: Both global audio and text representations are aggregated via a shared set of learnable codewords, facilitating cross-modal bases and enforcing semantic correspondence.
  • Hard-Negative Guided Loss: Augments the vanilla contrastive loss with hard-negative reweighting, increasing the pressure to distinguish challenging (@@@@10@@@@ negative) audio-caption pairs within the same batch.

Local representations are captured as PiRT×DP_i\in\mathbb{R}^{T\times D} for audio frames and QiRN×DQ_i\in\mathbb{R}^{N\times D} for text tokens, while global representations p~i,q~iRD\tilde p_i, \tilde q_i\in\mathbb{R}^D for clips and captions are codebook-aggregated and serve as the basis for contrastive alignment.

2. Shared Codebook Aggregation and Semantic Alignment

The codebook consists of MM learnable vectors {zkRD}k=1M\{z_k\in\mathbb{R}^D\}_{k=1}^M, trained end-to-end. For each codeword, affinity scores between codewords and local frame or word vectors are computed:

  • Audio Aggregation:

si,k(a)=max1jTPi,j,zkη,wi,k(a)=Sparsemax(si,:,(a))k,p~i=k=1Mwi,k(a)zks_{i,k}^{(a)} = \max_{1\leq j\leq T}\frac{\langle P_{i,j}, z_k \rangle}{\eta}, \quad w_{i,k}^{(a)} = \mathrm{Sparsemax}(s_{i,:,}^{(a)})_k,\quad \tilde p_i = \sum_{k=1}^M w_{i,k}^{(a)}z_k

  • Text Aggregation:

si,k(t)=max1NQi,,zkη,wi,k(t)=Sparsemax(si,:,(t))k,q~i=kwi,k(t)zks_{i,k}^{(t)} = \max_{1\leq \ell\leq N}\frac{\langle Q_{i,\ell}, z_k \rangle}{\eta}, \quad w_{i,k}^{(t)} = \mathrm{Sparsemax}(s_{i,:,}^{(t)})_k,\quad \tilde q_i = \sum_{k} w_{i,k}^{(t)}z_k

Sparsemax-induced sparsity ensures that only a small subset of codewords are activated per sample. During contrastive alignment, both modalities are compelled to “vote” for overlapping codeword activations, bridging the granularity gap and establishing implicit frame-to-word semantic alignment. Visualizations demonstrate that individual codewords correspond to discrete event categories (e.g., “dog barking”), with temporal activation only when the event occurs.

3. Locality Preservation and Refinement: The LA-Block

Transformer-based encoders traditionally diffuse contextual information globally, which can dilute precise temporal patterns vital for fine-grained event detection. The LA-Block addresses this by:

  • Removing the query-key attention from the last transformer block of the audio encoder, retaining only the value-projection:

V=WvU,U=VV = W_vU, \quad U' = V

  • Conventional feed-forward, residual, and normalization operations follow.

This design maintains the v–v similarity structure within event boundaries, crucial for event localization, while earlier blocks retain global context. Ablation studies show that the LA-Block alone boosts fine-grained PSDS1 scores by approximately 8.1 points.

4. Hard-Negative Guided Contrastive Loss

Standard symmetric CLAP loss is applied over batchwise global representations:

LCLAP=ilogep~i,q~i/τjep~i,q~j/τ        ilogeq~i,p~i/τjeq~i,p~j/τ\mathcal L_{\rm CLAP}=-\sum_i\log\frac{e^{\langle\tilde p_i,\tilde q_i\rangle/\tau}}{\sum_j e^{\langle\tilde p_i,\tilde q_j\rangle/\tau}} \;\;-\;\;\sum_i\log\frac{e^{\langle\tilde q_i,\tilde p_i\rangle/\tau}}{\sum_j e^{\langle\tilde q_i,\tilde p_j\rangle/\tau}}

To enhance discrimination, difficulty-based weights for hard negatives are computed:

αi,j=Bexp(γp~i,q~j/τ)k=1Bexp(γp~i,q~k/τ)\alpha_{i,j} = \frac{B\,\exp(\gamma\,\langle\tilde p_i,\tilde q_j\rangle/\tau)}{\sum_{k=1}^{B}\exp(\gamma\,\langle\tilde p_i,\tilde q_k\rangle/\tau)}

The loss becomes:

LHN=ilogep~i,q~i/τep~i,q~i/τ+jiαi,jep~i,q~j/τ\mathcal L_{\rm HN}=-\sum_i\log\frac{e^{\langle\tilde p_i,\tilde q_i\rangle/\tau}}{e^{\langle\tilde p_i,\tilde q_i\rangle/\tau}+\sum_{j\neq i}\alpha_{i,j}e^{\langle\tilde p_i,\tilde q_j\rangle/\tau}}

Symmetry is maintained with analogous terms for text-to-audio direction.

Tuning γ=0.15\gamma=0.15 was found most effective; excessive values over-emphasize the most difficult negatives and degrade overall alignment.

5. Training Setup and Optimization

The aggregate objective combines the hard-negative contrastive loss and codeword-norm regularization:

L=LHN+λregk(zk221)2\mathcal L = \mathcal L_{\rm HN} + \lambda_{\rm reg}\sum_k(\|z_k\|_2^2-1)^2

with λreg=0.01\lambda_{\rm reg}=0.01 preventing codeword collapse. Sparsemax and contrastive pressure suffice; explicit quantization is unnecessary.

Key hyperparameters:

  • Batch size: 128
  • Learning rate: 5×1055\times10^{-5}
  • Epochs: 10 (on 450K audio–caption pairs)
  • Learnable τ\tau (init 0.07) and codebook size M=4096M=4096
  • One LA-Block included (last transformer block replaced)
  • Pretrained audio encoder (AST or HTS-AT backbone)
  • Uniform LR schedule; early stopping via validation.

Model sizes: AST-based ≈ 200M parameters (110M text, 86M audio, plus codebook and heads); HTS-AT ≈ 140M parameters. Typical training epoch requires ~1.5 days on 8×A100 GPUs.

6. Empirical Performance and Task Coverage

GLCLAP demonstrates state-of-the-art or competitive results on eleven zero-shot benchmarks:

Task Type Dataset(s) Baseline CLAP MGA-CLAP (HTS-AT) Δ
Retrieval AudioCaps R@1 39.7% 41.8% +2.1 pts
Classification VGGSound accuracy 28.6% 31.8% +3.2 pts
Tagging FSD50K mAP 52.4% 54.5% +2.1 pts
Detection DESED PSDS1 13.1% 26.4% +13.3 pts
Grounding TAG PSDSm 34.4% 48.7% +14.3 pts

Ablations reveal:

  • Codebook alone boosts AudioCaps R@1 and DESED PSDS1 by 1.3 and 7.0 points, respectively.
  • LA-Block alone lifts PSDS1 by 8.1 points.
  • Combining all modules achieves full gains.

Codebook size M=4096M=4096 is optimal: smaller impairs detection, larger impairs retrieval. A single LA-Block replacement is optimal; further replacements reduce global context and degrade coarse-grained performance.

7. Limitations and Practical Implementation

Observed limitations include occasional confusions among acoustically similar events (e.g., blender vs. vacuum), persistent challenges in polyphonic scenes, and the requirement for task-specific fine-tuning for extremely short or long audio clips.

The training paradigm is scalable, but notable resource requirements persist (∼1.5 days training per epoch on 450K pairs, 8×A100 GPUs). The model structure allows for broad zero-shot applicability; however, complexities in optimizing the balance between codebook size and LA-Block deployments remain open for further research.

8. Significance and Theoretical Implications

By unifying global and local cross-modal representations via a shared codebook, preserving locality in frame features, and mining hard negatives in contrastive learning, GLCLAP exemplifies multi-grained alignment strategies for audio-language modeling. Individual codewords are shown to correlate directly with semantic event categories and establish frame-to-word localization, advancing explainability in alignment and supporting new paradigms in cross-modal retrieval, classification, tagging, detection, and grounding.

A plausible implication is that shared codebook-based architectures may be extensible to other multi-modal foundations, providing robust foundations for explainable, fine-grained alignment in future contrastive pre-training research (Li et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Global-Local Contrastive Language-Audio Pre-trained Model (GLCLAP).