OmniSIFT: Audio-Video Token Compression

Updated 6 February 2026

OmniSIFT is a modality-asymmetric token compression framework that optimizes audio-video inputs using spatio-temporal and cross-modal attention techniques.
It employs two key modules—Spatio-Temporal Video Pruning for salient visual feature selection and Vision-Guided Audio Selection to refine audio tokens.
Empirical evaluations show that OmniSIFT reduces FLOPs and latency while maintaining or surpassing full-token model accuracy at high compression rates.

OmniSIFT is a modality-asymmetric token compression framework designed for efficient processing of audio-video inputs in omni-modal LLMs (Omni-LLMs). It addresses the computational bottleneck arising from extended multimodal token sequences by introducing an end-to-end trainable, two-stage selection pipeline that leverages the distinct statistical structures of video and audio streams to optimize context efficiency while preserving—and in some cases enhancing—downstream model performance (Ding et al., 4 Feb 2026).

1. Architectural Overview

OmniSIFT operates as an intermediary module between standard audio/video encoders and the Omni-LLM backbone. It processes input in small temporal chunks, each containing two consecutive video frames and a corresponding audio segment. For each chunk $t$ , let the visual patch token embeddings from frames 1 and 2 be $F_1^{(t)}, F_2^{(t)} \in \mathbb{R}^{n_p \times D}$ , where $n_p$ is the number of patches per frame, and the audio token sequence $Z_a^{(t)} \in \mathbb{R}^{n_a \times D}$ .

The compression pipeline comprises:

Spatio-Temporal Video Pruning (STVP): Selects salient video tokens using spatial and temporal saliency assessments. The output comprises pruned visual anchors $\hat{Z}_v^{(t)} \in \mathbb{R}^{\hat{n}_v \times D}$ , where $\hat{n}_v = 2\lfloor \alpha_v n_p \rfloor$ .
Vision-Guided Audio Selection (VGAS): Applies lightweight cross-modal attention from audio tokens (queries) to the pruned visual anchors (keys/values) and scores audio tokens for selection. The output is the filtered set $\hat{Z}_a^{(t)} \in \mathbb{R}^{\hat{n}_a \times D}$ , with $\hat{n}_a = \lfloor \alpha_a n_a \rfloor$ .

Both modules execute differentiable TopK token selection, supporting end-to-end optimization via a straight-through estimator.

2. Spatio-Temporal Video Pruning

STVP implements a two-pronged saliency measure:

Spatial Saliency (within frame 1): For each patch embedding $v_{1,i}$ , compute $s_{1,i} = 1 - \cos(v_{1,i}, \bar{v}_1)$ , where $\bar{v}_1$ is the mean pooled embedding across all patches in frame 1. This quantifies a region's distinctiveness relative to the global frame context.
Temporal Saliency (across frames): For each patch location $i$ , compute $s_{2,i} = 1 - \cos(v_{2,i}, v_{1,i})$ . High $s_{2,i}$ signifies strong temporal changes.
Token Selection: The TopK scoring based on $s_{1,i}$ and $s_{2,i}$ retains the most salient regions, yielding pruned sets $F_1^{\text{hat}}$ and $F_2^{\text{hat}}$ . The selected video tokens are concatenated: $\hat{Z}_v^{(t)} = [F_1^{\text{hat}}; F_2^{\text{hat}}]$ .

3. Vision-Guided Audio Selection

VGAS employs pruned visual anchors to guide fine-grained audio token selection:

Audio-Visual Cross-Attention: Audio queries $Q = Z_a W_q$ attend to visual keys/values $K = \hat{Z}_v W_k$ , $V = \hat{Z}_v W_v$ . The attended representations for each audio token are $H_a = \mathrm{softmax}(QK^\top/\sqrt{d}) V$ .
Saliency Scoring: Each audio token scores via $s_{a,j} = \sigma(\mathrm{MLP}(h_{a,j}))$ , where $\sigma$ denotes sigmoid activation, and $\mathrm{MLP}$ is a multi-layer perceptron.
Token Selection: TopK based on $s_{a,j}$ yields $\hat{n}_a$ audio tokens $\hat{Z}_a$ .

Pseudocode for the per-chunk pipeline is as follows:

v1_mean = mean_pool(F1)
s1 = 1 - cosine(F1, v1_mean)
s2 = 1 - cosine(F2, F1)
keep = floor(alpha_v * n_p)
idx1 = TopK(s1, keep);  F1_hat = F1[idx1]
idx2 = TopK(s2, keep);  F2_hat = F2[idx2]
Z_v_hat = concat(F1_hat, F2_hat)

Q = W_q @ Z_a
K = W_k @ Z_v_hat;  V = W_v @ Z_v_hat
H_a = softmax(Q @ K.T / sqrt(d)) @ V
s_a = sigmoid(MLP(H_a))
keep_a = floor(alpha_a * n_a)
idx_a = TopK(s_a, keep_a)
Z_a_hat = Z_a[idx_a]

4. End-to-End Optimization and Training

Token selection via TopK is inherently non-differentiable. OmniSIFT adopts a straight-through estimator: in the forward pass, a binary mask is generated to retain selected tokens; in the backward pass, surrogates $\frac{\partial m_j}{\partial s_{a,j}} \approx 1$ permit gradient flow through the selection process. Loss signals from the LLM (either cross-entropy for QA/captioning or RL objectives) propagate into both the VGAS saliency scorer and STVP thresholds, enabling coordinated fine-tuning. For VGAS, supervised alignment uses 107K audio-visual caption pairs from the AVoCaDO SFT dataset.

5. Empirical Evaluation and Compression Impact

Benchmarked with Qwen2.5-Omni-7B and 3B backbones pre-aligned via modality projection, OmniSIFT was evaluated on:

VideoMME (audio-video QA)
DailyOmni (mixed-type QA)
WorldSense (real-world QA)
OmniVideoBench (open-ended QA)
video-SALMONN-2 (GPT-judge captioning)

Baselines include OmniZip (modality-symmetric attention), DyCoke (independent compression), and random pruning.

Retention	WorldSense	OmniVideoBench	VideoMME (avg)	SALMONN-2 total
Full	49.7	35.6	67.6	48.1
OmniZip	48.9	35.1	66.7	54.1
DyCoke	48.6	34.4	67.9	52.7
OmniSIFT	50.0	35.6	68.3	50.5

Notably, at 35% token retention, OmniSIFT achieves lower end-to-end latency (2.86s vs 4.94s for full, 2.89s for OmniZip), matches or exceeds GPU memory efficiency (22.91GB), and reaches or exceeds full-token accuracy in core benchmarks. At even stricter compression (25% retention), OmniSIFT outperforms both baselines and the uncompressed model on 3 out of 5 tasks.

In the WorldSense QA task, OmniSIFT reduced the total FLOPs by 54.9% at 25% context retention compared to the full-token approach.

6. Ablation Studies and Robustness

Ablative analyses confirm the contribution of each architectural component:

Compression Ratios: When video compression ratio $\rho_v$ is varied ( $0.1 \rightarrow 0.9$ ), OmniSIFT's accuracy remains stable (49–50%), while OmniZip degrades below 48% at high compression. Similarly, at $\rho_v=0.8$ , reducing audio retention $\rho_a$ maintains OmniSIFT's robustness, with OmniZip rapidly deteriorating.
Structure Removal: Ablating spatial or temporal saliency leads to 2.3% and 1.8% accuracy drops, respectively; removing VGAS in favor of audio-only self-attention degrades performance by 2.9–3.9%.
Selector Depth: Increasing VGAS selector depth from 1 to 3 layers shows negligible accuracy improvement and slightly increased memory use.
Paradigm Comparison: The modality-asymmetric (visual→audio) arrangement yields 2–4% higher accuracy under strong compression than symmetric or audio-guided variants.

7. Strengths, Limitations, and Prospects

OmniSIFT introduces only 4.85M parameters (<0.1% of Qwen2.5-Omni-7B), remains compatible with FlashAttention and optimized inference kernels, and offers substantial computational and memory advantages (∼50% reduction in FLOPs, >40% in latency) without sacrificing accuracy.

Primary limitations include reliance on a fixed 2-frame temporal sliding window, potentially omitting longer-range dynamics, and dependency on supervised fine-tuning on an audio-visual caption dataset for optimal VGAS alignment. Future work may incorporate dynamic chunk sizing, adaptive frame pairings, block-wise spectrogram pruning for audio, integration with token-merging strategies, and extension of modality-asymmetric selection to other paired modalities (e.g., using text to guide vision selection).

Collectively, OmniSIFT establishes a scalable, end-to-end trainable approach for modality-asymmetric token compression, achieving efficient inference and state-of-the-art retention–performance tradeoffs in omni-modal language modeling (Ding et al., 4 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniSIFT.