Papers
Topics
Authors
Recent
Search
2000 character limit reached

OmniSIFT: Audio-Video Token Compression

Updated 6 February 2026
  • OmniSIFT is a modality-asymmetric token compression framework that optimizes audio-video inputs using spatio-temporal and cross-modal attention techniques.
  • It employs two key modules—Spatio-Temporal Video Pruning for salient visual feature selection and Vision-Guided Audio Selection to refine audio tokens.
  • Empirical evaluations show that OmniSIFT reduces FLOPs and latency while maintaining or surpassing full-token model accuracy at high compression rates.

OmniSIFT is a modality-asymmetric token compression framework designed for efficient processing of audio-video inputs in omni-modal LLMs (Omni-LLMs). It addresses the computational bottleneck arising from extended multimodal token sequences by introducing an end-to-end trainable, two-stage selection pipeline that leverages the distinct statistical structures of video and audio streams to optimize context efficiency while preserving—and in some cases enhancing—downstream model performance (Ding et al., 4 Feb 2026).

1. Architectural Overview

OmniSIFT operates as an intermediary module between standard audio/video encoders and the Omni-LLM backbone. It processes input in small temporal chunks, each containing two consecutive video frames and a corresponding audio segment. For each chunk tt, let the visual patch token embeddings from frames 1 and 2 be F1(t),F2(t)Rnp×DF_1^{(t)}, F_2^{(t)} \in \mathbb{R}^{n_p \times D}, where npn_p is the number of patches per frame, and the audio token sequence Za(t)Rna×DZ_a^{(t)} \in \mathbb{R}^{n_a \times D}.

The compression pipeline comprises:

  • Spatio-Temporal Video Pruning (STVP): Selects salient video tokens using spatial and temporal saliency assessments. The output comprises pruned visual anchors Z^v(t)Rn^v×D\hat{Z}_v^{(t)} \in \mathbb{R}^{\hat{n}_v \times D}, where n^v=2αvnp\hat{n}_v = 2\lfloor \alpha_v n_p \rfloor.
  • Vision-Guided Audio Selection (VGAS): Applies lightweight cross-modal attention from audio tokens (queries) to the pruned visual anchors (keys/values) and scores audio tokens for selection. The output is the filtered set Z^a(t)Rn^a×D\hat{Z}_a^{(t)} \in \mathbb{R}^{\hat{n}_a \times D}, with n^a=αana\hat{n}_a = \lfloor \alpha_a n_a \rfloor.

Both modules execute differentiable TopK token selection, supporting end-to-end optimization via a straight-through estimator.

2. Spatio-Temporal Video Pruning

STVP implements a two-pronged saliency measure:

  • Spatial Saliency (within frame 1): For each patch embedding v1,iv_{1,i}, compute s1,i=1cos(v1,i,vˉ1)s_{1,i} = 1 - \cos(v_{1,i}, \bar{v}_1), where vˉ1\bar{v}_1 is the mean pooled embedding across all patches in frame 1. This quantifies a region's distinctiveness relative to the global frame context.
  • Temporal Saliency (across frames): For each patch location ii, compute s2,i=1cos(v2,i,v1,i)s_{2,i} = 1 - \cos(v_{2,i}, v_{1,i}). High s2,is_{2,i} signifies strong temporal changes.
  • Token Selection: The TopK scoring based on s1,is_{1,i} and s2,is_{2,i} retains the most salient regions, yielding pruned sets F1hatF_1^{\text{hat}} and F2hatF_2^{\text{hat}}. The selected video tokens are concatenated: Z^v(t)=[F1hat;F2hat]\hat{Z}_v^{(t)} = [F_1^{\text{hat}}; F_2^{\text{hat}}].

3. Vision-Guided Audio Selection

VGAS employs pruned visual anchors to guide fine-grained audio token selection:

  • Audio-Visual Cross-Attention: Audio queries Q=ZaWqQ = Z_a W_q attend to visual keys/values K=Z^vWkK = \hat{Z}_v W_k, V=Z^vWvV = \hat{Z}_v W_v. The attended representations for each audio token are Ha=softmax(QK/d)VH_a = \mathrm{softmax}(QK^\top/\sqrt{d}) V.
  • Saliency Scoring: Each audio token scores via sa,j=σ(MLP(ha,j))s_{a,j} = \sigma(\mathrm{MLP}(h_{a,j})), where σ\sigma denotes sigmoid activation, and MLP\mathrm{MLP} is a multi-layer perceptron.
  • Token Selection: TopK based on sa,js_{a,j} yields n^a\hat{n}_a audio tokens Z^a\hat{Z}_a.

Pseudocode for the per-chunk pipeline is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
v1_mean = mean_pool(F1)
s1 = 1 - cosine(F1, v1_mean)
s2 = 1 - cosine(F2, F1)
keep = floor(alpha_v * n_p)
idx1 = TopK(s1, keep);  F1_hat = F1[idx1]
idx2 = TopK(s2, keep);  F2_hat = F2[idx2]
Z_v_hat = concat(F1_hat, F2_hat)

Q = W_q @ Z_a
K = W_k @ Z_v_hat;  V = W_v @ Z_v_hat
H_a = softmax(Q @ K.T / sqrt(d)) @ V
s_a = sigmoid(MLP(H_a))
keep_a = floor(alpha_a * n_a)
idx_a = TopK(s_a, keep_a)
Z_a_hat = Z_a[idx_a]

4. End-to-End Optimization and Training

Token selection via TopK is inherently non-differentiable. OmniSIFT adopts a straight-through estimator: in the forward pass, a binary mask is generated to retain selected tokens; in the backward pass, surrogates mjsa,j1\frac{\partial m_j}{\partial s_{a,j}} \approx 1 permit gradient flow through the selection process. Loss signals from the LLM (either cross-entropy for QA/captioning or RL objectives) propagate into both the VGAS saliency scorer and STVP thresholds, enabling coordinated fine-tuning. For VGAS, supervised alignment uses 107K audio-visual caption pairs from the AVoCaDO SFT dataset.

5. Empirical Evaluation and Compression Impact

Benchmarked with Qwen2.5-Omni-7B and 3B backbones pre-aligned via modality projection, OmniSIFT was evaluated on:

  • VideoMME (audio-video QA)
  • DailyOmni (mixed-type QA)
  • WorldSense (real-world QA)
  • OmniVideoBench (open-ended QA)
  • video-SALMONN-2 (GPT-judge captioning)

Baselines include OmniZip (modality-symmetric attention), DyCoke (independent compression), and random pruning.

Retention WorldSense OmniVideoBench VideoMME (avg) SALMONN-2 total
Full 49.7 35.6 67.6 48.1
OmniZip 48.9 35.1 66.7 54.1
DyCoke 48.6 34.4 67.9 52.7
OmniSIFT 50.0 35.6 68.3 50.5

Notably, at 35% token retention, OmniSIFT achieves lower end-to-end latency (2.86s vs 4.94s for full, 2.89s for OmniZip), matches or exceeds GPU memory efficiency (22.91GB), and reaches or exceeds full-token accuracy in core benchmarks. At even stricter compression (25% retention), OmniSIFT outperforms both baselines and the uncompressed model on 3 out of 5 tasks.

In the WorldSense QA task, OmniSIFT reduced the total FLOPs by 54.9% at 25% context retention compared to the full-token approach.

6. Ablation Studies and Robustness

Ablative analyses confirm the contribution of each architectural component:

  • Compression Ratios: When video compression ratio ρv\rho_v is varied (0.10.90.1 \rightarrow 0.9), OmniSIFT's accuracy remains stable (49–50%), while OmniZip degrades below 48% at high compression. Similarly, at ρv=0.8\rho_v=0.8, reducing audio retention ρa\rho_a maintains OmniSIFT's robustness, with OmniZip rapidly deteriorating.
  • Structure Removal: Ablating spatial or temporal saliency leads to 2.3% and 1.8% accuracy drops, respectively; removing VGAS in favor of audio-only self-attention degrades performance by 2.9–3.9%.
  • Selector Depth: Increasing VGAS selector depth from 1 to 3 layers shows negligible accuracy improvement and slightly increased memory use.
  • Paradigm Comparison: The modality-asymmetric (visual→audio) arrangement yields 2–4% higher accuracy under strong compression than symmetric or audio-guided variants.

7. Strengths, Limitations, and Prospects

OmniSIFT introduces only 4.85M parameters (<0.1% of Qwen2.5-Omni-7B), remains compatible with FlashAttention and optimized inference kernels, and offers substantial computational and memory advantages (∼50% reduction in FLOPs, >40% in latency) without sacrificing accuracy.

Primary limitations include reliance on a fixed 2-frame temporal sliding window, potentially omitting longer-range dynamics, and dependency on supervised fine-tuning on an audio-visual caption dataset for optimal VGAS alignment. Future work may incorporate dynamic chunk sizing, adaptive frame pairings, block-wise spectrogram pruning for audio, integration with token-merging strategies, and extension of modality-asymmetric selection to other paired modalities (e.g., using text to guide vision selection).

Collectively, OmniSIFT establishes a scalable, end-to-end trainable approach for modality-asymmetric token compression, achieving efficient inference and state-of-the-art retention–performance tradeoffs in omni-modal language modeling (Ding et al., 4 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniSIFT.