AVT Compositional Fusion
- AVT Compositional Fusion is a transformer-based model that dynamically fuses audio, video, and textual embeddings using query-guided soft gating.
- It employs separate encoders and a gated fusion transformer to selectively align modality-specific features for robust retrieval and zero-shot recognition.
- Empirical results demonstrate significant improvements in retrieval accuracy and zero-shot performance compared to traditional averaging or concatenation methods.
AVT Compositional Fusion (AVT) encompasses a family of compositional transformer architectures designed for the integration of multimodal information. Originating in the context of audio-visual content retrieval and zero-shot image recognition, AVT mechanisms provide query- or attribute-guided compositional fusion by dynamically modulating the influence of modality- or attribute-specific embeddings according to high-level semantic cues. The method is characterized by selective alignment and fusion of visual, auditory, and structured linguistic features, enabling robust cross-modal reasoning and fine-grained retrieval or recognition in multi-component scenarios (Han et al., 30 Jan 2026, Chen et al., 2021).
1. Architectural Overview
In audio-visual retrieval, AVT Compositional Fusion is the final stage of the CoVA framework, operating atop a backbone that extracts and fuses video and audio information. The architecture comprises three main modules (Han et al., 30 Jan 2026):
- Feature Extraction: Separate encoders for each modality—video frames via the CLIP image encoder (ViT-B/32), audio via an Audio Spectrogram Transformer (AST), and text queries via the CLIP text encoder.
- Gated Fusion Transformer (GFT): Cross-attention transformer that integrates audio and visual streams, refining spatial-temporal representations.
- AVT Compositional Fusion Module: Implements a query-aware gating mechanism over the fused audio-video embedding and a set of structured textual sub-embeddings (corresponding to object, action, attribute, and audio-modifier fields). The system produces a single -dimensional retrieval vector for nearest-neighbor search against a database of candidate video embeddings.
In the context of zero-shot learning within the TransZero++ architecture, AVT refers to an “Attribute→Visual Transformer” sub-module that composes semantic attribute information with CNN-extracted grid features, yielding semantic-augmented visual embeddings (Chen et al., 2021).
2. Modality- and Attribute-Specific Encoding
Video Encoding: Uniformly sampled video frames are passed through a frozen CLIP image encoder (ViT-B/32), yielding per-frame embeddings that are stacked into (Han et al., 30 Jan 2026).
Audio Encoding: Raw audio waveforms are converted to log-Mel spectrograms and encoded using a frozen AST pre-trained on AudioSet/ImageNet. A token resampling mechanism produces embeddings (Han et al., 30 Jan 2026).
Text Encoding (Structured Query): Modification queries are structured into four distinct linguistic fields: object, action, attribute, and audio modifier. Each field is independently embedded by passing the token sequence through the CLIP text encoder and extracting the final [EOS] token output, giving (Han et al., 30 Jan 2026).
Attribute Embedding (ZSL): In TransZero++, attributes (e.g., GloVe-encoded) serve as queries to a cross-attention module, enabling attribute-specific region localization over CNN feature maps (Chen et al., 2021).
3. Mathematical Formulation of Compositional Fusion
In CoVA’s AVT fusion, the approach is characterized by the following operations (Han et al., 30 Jan 2026):
- The GFT operates as:
for layers, with mean-pooling over refined visual tokens to yield .
- All five vectors——are concatenated and passed through an MLP, producing five scalar gates after sigmoid activation:
followed by L2 normalization. This yields a single retrieval query vector whose composition is dynamically conditioned on the structured text query.
In TransZero++, AVT operates as follows (Chen et al., 2021):
- A Feature Augmentation Encoder (FAE) decorrelates grid features via geometry-bias subtraction in attention, producing .
- The Attribute→Visual Decoder performs cross-attention between attribute query vectors and region features, resulting in per-attribute visual feature vectors .
- These are linearly mapped back to the attribute embedding space:
yielding -dimensional semantic-augmented logits.
4. Query-Guided and Attribute-Conditioned Modality Alignment
The AVT mechanism implements soft, query- or attribute-guided gating/routing of multimodal signals (Han et al., 30 Jan 2026, Chen et al., 2021):
- In AVT Compositional Fusion, the gating MLP produces weights conditioned on the concatenated multimodal features, enabling “dynamic routing” of the query to the most relevant modalities. For example, if the text mentions only an audio modification, becomes large and other diminish, but the process is fully differentiable and realized by soft weights (no hard routing).
- In TransZero++’s AVT, the use of attributes as queries in multi-head cross-attention enables spatially localized compositional fusion, aligning discriminative attribute regions to semantic attribute tokens.
This design supports cross-modal and attribute-based compositional reasoning and allows the adaptation of the global representation in response to the semantic structure of the input query or task.
5. Training Objectives and Optimization
CoVA AVT Fusion: Training employs a symmetric InfoNCE contrastive loss that aligns each composed query embedding (from reference video and text) with its corresponding annotated target , while treating within-batch distractors as negatives. The loss (with temperature parameter ) is
TransZero++ AVT: The training objective is a weighted sum of attribute regression, attribute-based cross-entropy, self-calibration (for unseen-class generalization), and two semantical collaborative losses (feature- and prediction-alignment between AVT and its VAT sibling). The AVT-only loss is:
with the full system trained by optimizing AVT, VAT, and collaborative terms jointly (Chen et al., 2021).
Optimization: In CoVA, all backbone encoders are kept frozen, and only the fusion (GFT, AVT) modules are trained using AdamW with batch size 256, learning rate , and a learnable temperature initialized to 0.07 (Han et al., 30 Jan 2026).
6. Empirical Evaluation and Ablation Analysis
CoVA/AV-Comp Benchmark Results: AVT Compositional Fusion demonstrates substantial gains in retrieval accuracy:
| Modality Fusion | R@1 | R@5 | R@10 | MnR |
|---|---|---|---|---|
| Text only | 19.7% | 44.9% | 60.5% | 19.9 |
| Video only | 21.5% | 49.7% | 65.3% | 21.4 |
| Audio only | 1.0% | 1.8% | 3.9% | 542.8 |
| Video+Audio (GFT) | 22.3% | — | — | — |
| Video+Text (avg) | 28.8% | — | — | — |
| Audio+Text (avg) | 22.2% | — | — | — |
| GFT AV + avg Text | 30.4% | — | — | — |
| GFT AV + AVT (full, CoVA) | 31.4% | 73.7% | 86.4% | 6.2 |
| CoVA (end-to-end) | 35.9% | 73.7% | 86.4% | 6.2 |
| ImageBind + GFT+AVT | 20.2% | — | — | — |
| LanguageBind + GFT+AVT | 27.2% | — | — | — |
Ablation experiments confirm that omitting any of the four textual components results in a consistent performance drop (e.g., w/o : R@1 decreases to 26.8%) (Han et al., 30 Jan 2026).
TransZero++: The AVT module, coupled with its VAT counterpart and collaborative learning, achieves state-of-the-art recognition accuracy on three standard zero-shot learning benchmarks. Attribute disentanglement and localized attention are highlighted as key contributors (Chen et al., 2021).
7. Positioning and Comparison with Related Approaches
AVT Compositional Fusion distinguishes itself from prior average- or concatenation-based multimodal fusion by implementing query-conditioned, learnable soft gating over modality- or attribute-specific embeddings. Unlike hard routing, AVT employs a lightweight MLP to dynamically assign contribution weights within the fused representation, providing adaptability to query semantics and enhanced discriminability (Han et al., 30 Jan 2026). In zero-shot tasks, AVT’s cross-modal transformers excel by explicitly localizing attribute-based cues via geometry-disentangled attention (Chen et al., 2021).
Relative performance against alternative backbones (LanguageBind, ImageBind) and ablation studies underline the centrality of compositional modulation and structured representation in the AVT paradigm. A plausible implication is that such compositional soft-gating frameworks may generalize to other domains where query-driven or compositional retrieval and semantic zero-shot transfer are required.
Key References:
- "CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content" (Han et al., 30 Jan 2026)
- "TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning" (Chen et al., 2021)