OmniSIFT: Audio-Video Token Compression
- OmniSIFT is a modality-asymmetric token compression framework that optimizes audio-video inputs using spatio-temporal and cross-modal attention techniques.
- It employs two key modules—Spatio-Temporal Video Pruning for salient visual feature selection and Vision-Guided Audio Selection to refine audio tokens.
- Empirical evaluations show that OmniSIFT reduces FLOPs and latency while maintaining or surpassing full-token model accuracy at high compression rates.
OmniSIFT is a modality-asymmetric token compression framework designed for efficient processing of audio-video inputs in omni-modal LLMs (Omni-LLMs). It addresses the computational bottleneck arising from extended multimodal token sequences by introducing an end-to-end trainable, two-stage selection pipeline that leverages the distinct statistical structures of video and audio streams to optimize context efficiency while preserving—and in some cases enhancing—downstream model performance (Ding et al., 4 Feb 2026).
1. Architectural Overview
OmniSIFT operates as an intermediary module between standard audio/video encoders and the Omni-LLM backbone. It processes input in small temporal chunks, each containing two consecutive video frames and a corresponding audio segment. For each chunk , let the visual patch token embeddings from frames 1 and 2 be , where is the number of patches per frame, and the audio token sequence .
The compression pipeline comprises:
- Spatio-Temporal Video Pruning (STVP): Selects salient video tokens using spatial and temporal saliency assessments. The output comprises pruned visual anchors , where .
- Vision-Guided Audio Selection (VGAS): Applies lightweight cross-modal attention from audio tokens (queries) to the pruned visual anchors (keys/values) and scores audio tokens for selection. The output is the filtered set , with .
Both modules execute differentiable TopK token selection, supporting end-to-end optimization via a straight-through estimator.
2. Spatio-Temporal Video Pruning
STVP implements a two-pronged saliency measure:
- Spatial Saliency (within frame 1): For each patch embedding , compute , where is the mean pooled embedding across all patches in frame 1. This quantifies a region's distinctiveness relative to the global frame context.
- Temporal Saliency (across frames): For each patch location , compute . High signifies strong temporal changes.
- Token Selection: The TopK scoring based on and retains the most salient regions, yielding pruned sets and . The selected video tokens are concatenated: .
3. Vision-Guided Audio Selection
VGAS employs pruned visual anchors to guide fine-grained audio token selection:
- Audio-Visual Cross-Attention: Audio queries attend to visual keys/values , . The attended representations for each audio token are .
- Saliency Scoring: Each audio token scores via , where denotes sigmoid activation, and is a multi-layer perceptron.
- Token Selection: TopK based on yields audio tokens .
Pseudocode for the per-chunk pipeline is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
v1_mean = mean_pool(F1) s1 = 1 - cosine(F1, v1_mean) s2 = 1 - cosine(F2, F1) keep = floor(alpha_v * n_p) idx1 = TopK(s1, keep); F1_hat = F1[idx1] idx2 = TopK(s2, keep); F2_hat = F2[idx2] Z_v_hat = concat(F1_hat, F2_hat) Q = W_q @ Z_a K = W_k @ Z_v_hat; V = W_v @ Z_v_hat H_a = softmax(Q @ K.T / sqrt(d)) @ V s_a = sigmoid(MLP(H_a)) keep_a = floor(alpha_a * n_a) idx_a = TopK(s_a, keep_a) Z_a_hat = Z_a[idx_a] |
4. End-to-End Optimization and Training
Token selection via TopK is inherently non-differentiable. OmniSIFT adopts a straight-through estimator: in the forward pass, a binary mask is generated to retain selected tokens; in the backward pass, surrogates permit gradient flow through the selection process. Loss signals from the LLM (either cross-entropy for QA/captioning or RL objectives) propagate into both the VGAS saliency scorer and STVP thresholds, enabling coordinated fine-tuning. For VGAS, supervised alignment uses 107K audio-visual caption pairs from the AVoCaDO SFT dataset.
5. Empirical Evaluation and Compression Impact
Benchmarked with Qwen2.5-Omni-7B and 3B backbones pre-aligned via modality projection, OmniSIFT was evaluated on:
- VideoMME (audio-video QA)
- DailyOmni (mixed-type QA)
- WorldSense (real-world QA)
- OmniVideoBench (open-ended QA)
- video-SALMONN-2 (GPT-judge captioning)
Baselines include OmniZip (modality-symmetric attention), DyCoke (independent compression), and random pruning.
| Retention | WorldSense | OmniVideoBench | VideoMME (avg) | SALMONN-2 total |
|---|---|---|---|---|
| Full | 49.7 | 35.6 | 67.6 | 48.1 |
| OmniZip | 48.9 | 35.1 | 66.7 | 54.1 |
| DyCoke | 48.6 | 34.4 | 67.9 | 52.7 |
| OmniSIFT | 50.0 | 35.6 | 68.3 | 50.5 |
Notably, at 35% token retention, OmniSIFT achieves lower end-to-end latency (2.86s vs 4.94s for full, 2.89s for OmniZip), matches or exceeds GPU memory efficiency (22.91GB), and reaches or exceeds full-token accuracy in core benchmarks. At even stricter compression (25% retention), OmniSIFT outperforms both baselines and the uncompressed model on 3 out of 5 tasks.
In the WorldSense QA task, OmniSIFT reduced the total FLOPs by 54.9% at 25% context retention compared to the full-token approach.
6. Ablation Studies and Robustness
Ablative analyses confirm the contribution of each architectural component:
- Compression Ratios: When video compression ratio is varied (), OmniSIFT's accuracy remains stable (49–50%), while OmniZip degrades below 48% at high compression. Similarly, at , reducing audio retention maintains OmniSIFT's robustness, with OmniZip rapidly deteriorating.
- Structure Removal: Ablating spatial or temporal saliency leads to 2.3% and 1.8% accuracy drops, respectively; removing VGAS in favor of audio-only self-attention degrades performance by 2.9–3.9%.
- Selector Depth: Increasing VGAS selector depth from 1 to 3 layers shows negligible accuracy improvement and slightly increased memory use.
- Paradigm Comparison: The modality-asymmetric (visual→audio) arrangement yields 2–4% higher accuracy under strong compression than symmetric or audio-guided variants.
7. Strengths, Limitations, and Prospects
OmniSIFT introduces only 4.85M parameters (<0.1% of Qwen2.5-Omni-7B), remains compatible with FlashAttention and optimized inference kernels, and offers substantial computational and memory advantages (∼50% reduction in FLOPs, >40% in latency) without sacrificing accuracy.
Primary limitations include reliance on a fixed 2-frame temporal sliding window, potentially omitting longer-range dynamics, and dependency on supervised fine-tuning on an audio-visual caption dataset for optimal VGAS alignment. Future work may incorporate dynamic chunk sizing, adaptive frame pairings, block-wise spectrogram pruning for audio, integration with token-merging strategies, and extension of modality-asymmetric selection to other paired modalities (e.g., using text to guide vision selection).
Collectively, OmniSIFT establishes a scalable, end-to-end trainable approach for modality-asymmetric token compression, achieving efficient inference and state-of-the-art retention–performance tradeoffs in omni-modal language modeling (Ding et al., 4 Feb 2026).