Papers
Topics
Authors
Recent
Search
2000 character limit reached

AVT Compositional Fusion

Updated 6 February 2026
  • AVT Compositional Fusion is a transformer-based model that dynamically fuses audio, video, and textual embeddings using query-guided soft gating.
  • It employs separate encoders and a gated fusion transformer to selectively align modality-specific features for robust retrieval and zero-shot recognition.
  • Empirical results demonstrate significant improvements in retrieval accuracy and zero-shot performance compared to traditional averaging or concatenation methods.

AVT Compositional Fusion (AVT) encompasses a family of compositional transformer architectures designed for the integration of multimodal information. Originating in the context of audio-visual content retrieval and zero-shot image recognition, AVT mechanisms provide query- or attribute-guided compositional fusion by dynamically modulating the influence of modality- or attribute-specific embeddings according to high-level semantic cues. The method is characterized by selective alignment and fusion of visual, auditory, and structured linguistic features, enabling robust cross-modal reasoning and fine-grained retrieval or recognition in multi-component scenarios (Han et al., 30 Jan 2026, Chen et al., 2021).

1. Architectural Overview

In audio-visual retrieval, AVT Compositional Fusion is the final stage of the CoVA framework, operating atop a backbone that extracts and fuses video and audio information. The architecture comprises three main modules (Han et al., 30 Jan 2026):

  • Feature Extraction: Separate encoders for each modality—video frames via the CLIP image encoder (ViT-B/32), audio via an Audio Spectrogram Transformer (AST), and text queries via the CLIP text encoder.
  • Gated Fusion Transformer (GFT): Cross-attention transformer that integrates audio and visual streams, refining spatial-temporal representations.
  • AVT Compositional Fusion Module: Implements a query-aware gating mechanism over the fused audio-video embedding and a set of structured textual sub-embeddings (corresponding to object, action, attribute, and audio-modifier fields). The system produces a single DD-dimensional retrieval vector favtf_{avt} for nearest-neighbor search against a database of candidate video embeddings.

In the context of zero-shot learning within the TransZero++ architecture, AVT refers to an “Attribute→Visual Transformer” sub-module that composes semantic attribute information with CNN-extracted grid features, yielding semantic-augmented visual embeddings (Chen et al., 2021).

2. Modality- and Attribute-Specific Encoding

Video Encoding: Uniformly sampled video frames are passed through a frozen CLIP image encoder (ViT-B/32), yielding NN per-frame embeddings fnRDf_n\in \mathbb{R}^D that are stacked into fRN×Df\in \mathbb{R}^{N\times D} (Han et al., 30 Jan 2026).

Audio Encoding: Raw audio waveforms are converted to log-Mel spectrograms and encoded using a frozen AST pre-trained on AudioSet/ImageNet. A token resampling mechanism produces aRM×Da\in \mathbb{R}^{M\times D} embeddings (Han et al., 30 Jan 2026).

Text Encoding (Structured Query): Modification queries are structured into four distinct linguistic fields: object, action, attribute, and audio modifier. Each field is independently embedded by passing the token sequence through the CLIP text encoder and extracting the final [EOS] token output, giving tobj,tact,tatt,taudmRDt_{\text{obj}}, t_{\text{act}}, t_{\text{att}}, t_{\text{audm}} \in \mathbb{R}^D (Han et al., 30 Jan 2026).

Attribute Embedding (ZSL): In TransZero++, attributes vaRdv_a \in \mathbb{R}^d (e.g., GloVe-encoded) serve as queries to a cross-attention module, enabling attribute-specific region localization over CNN feature maps U(x)RC×(HW)U(x)\in\mathbb{R}^{C\times (H\,W)} (Chen et al., 2021).

3. Mathematical Formulation of Compositional Fusion

In CoVA’s AVT fusion, the approach is characterized by the following operations (Han et al., 30 Jan 2026):

  • The GFT operates as:

f(+1)=LayerNorm(f()+MultiHeadAttn(Q=f(),K=a(),V=a()))f^{(\ell+1)} = \mathrm{LayerNorm}\left(f^{(\ell)} + \mathrm{MultiHeadAttn}\left(Q=f^{(\ell)}, K=a^{(\ell)}, V=a^{(\ell)}\right)\right)

for LL layers, with mean-pooling over NN refined visual tokens to yield favRDf_{av} \in \mathbb{R}^D.

  • All five vectors—fav,tobj,tact,tatt,taudmf_{av}, t_{\text{obj}}, t_{\text{act}}, t_{\text{att}}, t_{\text{audm}}—are concatenated and passed through an MLP, producing five scalar gates wiw_i after sigmoid activation:

[favtobjtacttatttaudm]MLPsigmoid[f_{av} \, || \, t_{\text{obj}} \, || \, t_{\text{act}} \, || \, t_{\text{att}} \, || \, t_{\text{audm}} ] \rightarrow \mathrm{MLP} \rightarrow \mathrm{sigmoid}

favt=i=15wifif_{avt} = \sum_{i=1}^5 w_i f_i

followed by L2 normalization. This yields a single retrieval query vector whose composition is dynamically conditioned on the structured text query.

In TransZero++, AVT operates as follows (Chen et al., 2021):

  • A Feature Augmentation Encoder (FAE) decorrelates grid features via geometry-bias subtraction in attention, producing Uaug(x)U_{\rm aug}(x).
  • The Attribute→Visual Decoder performs cross-attention between AA attribute query vectors and region features, resulting in per-attribute visual feature vectors FaF_a.
  • These are linearly mapped back to the attribute embedding space:

ψ(x)=VAW3F\psi(x) = \mathcal{V}_A^\top W_3 F

yielding AA-dimensional semantic-augmented logits.

4. Query-Guided and Attribute-Conditioned Modality Alignment

The AVT mechanism implements soft, query- or attribute-guided gating/routing of multimodal signals (Han et al., 30 Jan 2026, Chen et al., 2021):

  • In AVT Compositional Fusion, the gating MLP produces weights wiw_i conditioned on the concatenated multimodal features, enabling “dynamic routing” of the query to the most relevant modalities. For example, if the text mentions only an audio modification, waudmw_{\mathrm{audm}} becomes large and other wiw_i diminish, but the process is fully differentiable and realized by soft weights (no hard routing).
  • In TransZero++’s AVT, the use of attributes as queries in multi-head cross-attention enables spatially localized compositional fusion, aligning discriminative attribute regions to semantic attribute tokens.

This design supports cross-modal and attribute-based compositional reasoning and allows the adaptation of the global representation in response to the semantic structure of the input query or task.

5. Training Objectives and Optimization

CoVA AVT Fusion: Training employs a symmetric InfoNCE contrastive loss that aligns each composed query embedding favtqf_{avt}^q (from reference video and text) with its corresponding annotated target favtf_{av}^t, while treating within-batch distractors as negatives. The loss (with temperature parameter τ\tau) is

LInfoNCE=12Ni=1N[logexp(qiti/τ)j=1Nexp(qitj/τ)+logexp(tiqi/τ)j=1Nexp(tiqj/τ)]\mathcal{L}_{\mathrm{InfoNCE}} = -\frac{1}{2N} \sum_{i=1}^N \left[ \log \frac{\exp(q_i^\top t_i/\tau)}{\sum_{j=1}^N \exp(q_i^\top t_j/\tau)} + \log \frac{\exp(t_i^\top q_i/\tau)}{\sum_{j=1}^N \exp(t_i^\top q_j/\tau)} \right]

(Han et al., 30 Jan 2026).

TransZero++ AVT: The training objective is a weighted sum of attribute regression, attribute-based cross-entropy, self-calibration (for unseen-class generalization), and two semantical collaborative losses (feature- and prediction-alignment between AVT and its VAT sibling). The AVT-only loss is:

LAVT=LACEAVT+λARLARAVT+λSCLSCAVT\mathcal{L}_{\rm AVT} = \mathcal{L}_{\rm ACE}^{\rm AVT} +\lambda_{\rm AR}\,\mathcal{L}_{\rm AR}^{\rm AVT} +\lambda_{\rm SC}\,\mathcal{L}_{\rm SC}^{\rm AVT}

with the full system trained by optimizing AVT, VAT, and collaborative terms jointly (Chen et al., 2021).

Optimization: In CoVA, all backbone encoders are kept frozen, and only the fusion (GFT, AVT) modules are trained using AdamW with batch size 256, learning rate 1×1041\times 10^{-4}, and a learnable temperature τ\tau initialized to 0.07 (Han et al., 30 Jan 2026).

6. Empirical Evaluation and Ablation Analysis

CoVA/AV-Comp Benchmark Results: AVT Compositional Fusion demonstrates substantial gains in retrieval accuracy:

Modality Fusion R@1 R@5 R@10 MnR
Text only 19.7% 44.9% 60.5% 19.9
Video only 21.5% 49.7% 65.3% 21.4
Audio only 1.0% 1.8% 3.9% 542.8
Video+Audio (GFT) 22.3%
Video+Text (avg) 28.8%
Audio+Text (avg) 22.2%
GFT AV + avg Text 30.4%
GFT AV + AVT (full, CoVA) 31.4% 73.7% 86.4% 6.2
CoVA (end-to-end) 35.9% 73.7% 86.4% 6.2
ImageBind + GFT+AVT 20.2%
LanguageBind + GFT+AVT 27.2%

Ablation experiments confirm that omitting any of the four textual components results in a consistent performance drop (e.g., w/o tobjt_{\text{obj}}: R@1 decreases to 26.8%) (Han et al., 30 Jan 2026).

TransZero++: The AVT module, coupled with its VAT counterpart and collaborative learning, achieves state-of-the-art recognition accuracy on three standard zero-shot learning benchmarks. Attribute disentanglement and localized attention are highlighted as key contributors (Chen et al., 2021).

AVT Compositional Fusion distinguishes itself from prior average- or concatenation-based multimodal fusion by implementing query-conditioned, learnable soft gating over modality- or attribute-specific embeddings. Unlike hard routing, AVT employs a lightweight MLP to dynamically assign contribution weights within the fused representation, providing adaptability to query semantics and enhanced discriminability (Han et al., 30 Jan 2026). In zero-shot tasks, AVT’s cross-modal transformers excel by explicitly localizing attribute-based cues via geometry-disentangled attention (Chen et al., 2021).

Relative performance against alternative backbones (LanguageBind, ImageBind) and ablation studies underline the centrality of compositional modulation and structured representation in the AVT paradigm. A plausible implication is that such compositional soft-gating frameworks may generalize to other domains where query-driven or compositional retrieval and semantic zero-shot transfer are required.


Key References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AVT Compositional Fusion (AVT).