Papers
Topics
Authors
Recent
Search
2000 character limit reached

VidCom²: Plug-and-play Adaptive Video Compression

Updated 12 February 2026
  • VidCom² is a video compression framework that adaptively assigns token retention per frame based on uniqueness to reduce computational overhead.
  • It employs efficient vector operations and rank statistics to compress tokens while preserving essential visual information, achieving up to 70.8% latency reduction.
  • VidCom² integrates seamlessly with modern VideoLLMs without retraining, ensuring compatibility with optimized attention operators like FlashAttention-2.

Video Compression Commander (VidCom²) is a plug-and-play, training-free inference acceleration framework that addresses the efficiency bottlenecks of Video LLMs (VideoLLMs) arising from dense visual token streams. By adaptively compressing visual tokens based on frame-level uniqueness, VidCom² preserves essential information while achieving substantial reductions in computational cost and latency. It is explicitly designed for compatibility with a wide range of modern VideoLLM architectures and efficient attention operators, requiring no retraining or architectural modifications (Liu et al., 20 May 2025).

1. Motivation and Context

VideoLLMs, such as LLaVA-OneVision, LLaVA-Video, and Qwen2-VL, ingest sequences of TT frames where each frame produces MM visual tokens. The resulting tensor has TMT \cdot M tokens, and downstream self- and cross-attention operations have time and space complexity O((TM)2)O((T \cdot M)^2). This quadratic scaling renders inference prohibitively expensive, particularly for long video sequences or models with large MM. Existing token compression approaches prior to VidCom² either:

  • Depend on [CLS] tokens (pre-LLM compression, e.g., FasterVLM, MUSTDrop, FiCoCo), rendering them incompatible with modern SigLIP-based encoders lacking a [CLS] token.
  • Require explicit inspection of LLM attention matrices (intra-LLM compression, e.g., FastV, PDrop, SparseVLM), which is incompatible with optimized attention kernels such as FlashAttention and incurs significant memory overhead.
  • Apply fixed-window, uniform compression budgets regardless of per-frame informativeness (e.g., DyCoke), missing opportunities to preserve uniquely informative frames, particularly under aggressive token reduction regimes (Liu et al., 20 May 2025).

2. Foundational Design Principles

VidCom² is guided by three core principles:

  1. Model Adaptability: The method must be plug-and-play on arbitrary VideoLLM pipelines (including SigLIP, ViT→MLP→LLM) with no retraining or architecture changes.
  2. Frame Uniqueness: The allocation of compression budgets across frames must be adaptive, favoring frames with higher relative uniqueness in representing salient content.
  3. Operator Compatibility: The method must avoid any reliance on [CLS] tokens or explicit LLM attention weights, maintaining compatibility with optimized attention implementations (e.g. FlashAttention-2), and must not increase peak memory usage (Liu et al., 20 May 2025).

3. Algorithmic Architecture

VidCom² operates in two stages—Frame Compression Adjustment and Adaptive Token Compression—requiring only vector operations and rank statistics:

3.1 Quantifying Frame Uniqueness

Let Xv={xt,mvRD}X^v = \{x_{t,m}^v \in \mathbb{R}^{D'}\} represent the T×MT \times M visual tokens generated by the video encoder.

  1. Global video representation:

gv=1TMt=1Tm=1Mxt,mv,\mathbf{g}_v = \frac{1}{TM} \sum_{t=1}^T \sum_{m=1}^M x_{t,m}^v,

where gvRD\mathbf{g}_v \in \mathbb{R}^{D'}.

  1. Token-to-video similarity (cosine):

st,mvideo=xt,mvgvxt,mvgv,ut,mvideo=st,mvideos^{\mathrm{video}}_{t,m} = \frac{x_{t,m}^v \cdot \mathbf{g}_v}{\|x_{t,m}^v\|\|\mathbf{g}_v\|}, \qquad u^{\mathrm{video}}_{t,m} = -s^{\mathrm{video}}_{t,m}

Tokens with higher ut,mvideou^{\mathrm{video}}_{t,m} are more unique relative to the video as a whole.

  1. Frame uniqueness score:

ut=1Mm=1Mut,mvideou_t = \frac{1}{M} \sum_{m=1}^M u^{\mathrm{video}}_{t,m}

Higher utu_t indicates a more distinctive frame.

3.2 Adaptive Compression Assignment

  1. Frame scores to compression ratios via softmax normalization (temperature τ=0.01\tau=0.01):

u~t=utmaxkukτ,σt=exp(u~t)=1Texp(u~)+ϵ\tilde{u}_t = \frac{u_t - \max_k u_k}{\tau}, \qquad \sigma_t = \frac{\exp(\tilde{u}_t)}{\sum_{\ell=1}^T \exp(\tilde{u}_\ell) + \epsilon}

with ϵ=108\epsilon = 10^{-8} for numerical stability.

  1. Per-frame retention ratio:

rt=R(1+σt1T),r_t = R \cdot (1 + \sigma_t - \tfrac{1}{T}),

where RR is the global retention ratio. trt/T=R\sum_t r_t/T = R by construction.

  1. Algorithmic pseudocode (Frame Compression Adjustment):
    1
    2
    3
    4
    5
    6
    
    for t=1..T:
        u_t ← (1/M) * sum_{m=1}^M (– cosine(x_{t,m}, g_v))
    Normalize:
        ŭ_t ← (u_t – max_k u_k)/τ
        σ_t ← exp(ŭ_t) / (sum_k exp(ŭ_k)+ε)
        r_t ← R*(1 + σ_t – 1/T)

3.3 Token Selection Within Frames

Given rtr_t, select kt=rtMk_t = \lceil r_t \cdot M \rceil tokens per frame via combined video- and frame-level uniqueness:

  1. Frame-global representation:

gf,t=1Mm=1Mxt,mv\mathbf{g}_{f, t} = \frac{1}{M} \sum_{m=1}^M x_{t,m}^v

  1. Token-to-frame similarity and uniqueness:

st,mframe=xt,mvgf,txt,mvgf,t,ut,mframe=st,mframes_{t,m}^{\mathrm{frame}} = \frac{x_{t,m}^v \cdot \mathbf{g}_{f,t}}{\|x_{t,m}^v\|\|\mathbf{g}_{f,t}\|}, \qquad u_{t,m}^{\mathrm{frame}} = -s_{t,m}^{\mathrm{frame}}

  1. Combined token score:

ut,m=ut,mvideo+ut,mframeu_{t,m} = u_{t,m}^{\mathrm{video}} + u_{t,m}^{\mathrm{frame}}

Retain the top ktk_t tokens by ut,mu_{t,m} ranking (Liu et al., 20 May 2025).

4. System Integration and Operator Compatibility

VidCom² is positioned between the video encoder (ViT→MLP) and the LLM. Its operations consist solely of pooling, cosine similarities, softmax normalization, and top-kk selection. It never requires explicit inspection of LLM attention or [CLS] tokens, ensuring maximal compatibility with highly efficient operator kernels such as FlashAttention-2. Peak memory footprint is not increased, as all computation occurs on encoder outputs prior to LLM input (Liu et al., 20 May 2025).

Furthermore, the Frame Compression Adjustment strategy can function as a modular "plug-in" for other token selection schemes (e.g., FastV, SparseVLM), replacing the uniform per-frame token budget with the optimally adaptive rt{r_t} allocation.

5. Empirical Performance

Evaluation is conducted across VideoLLMs (LLaVA-OneVision 7B, LLaVA-Video, Qwen2-VL) and video understanding benchmarks including MVBench, LongVideoBench, MLVU, and VideoMME (S/M/L) via LMMs-Eval. Key findings:

Method Retention LLaVA-OV-7B Avg % of Orig Latency Reduction Throughput
Vanilla 100% 56.9 100.0% -- 0.64×
VidCom² 25% 57.2 99.6% −70.8% LLM Gen 1.38×
DyCoke@30% 30% 56.6 96.5% -- --
VidCom² 15% -- 95.1% -- --

For Qwen2-VL on long VideoMME, VidCom² at 25% retention attains 101.2% of reference performance, exceeding DyCoke (93.6%) and SparseVLM (96.6%).

Token scoring that combines frame and video uniqueness outperforms either approach alone. Averaging token uniqueness is preferred for utu_t, and the Frame Compression Adjustment module consistently improves the robustness of other token compression schemes. Complexity overhead is low: on the MVBench set, the end-to-end scoring for all tokens adds only 2.5 s, equivalent to 1.3% of LLM generation time, while reducing overall LLM generation latency by 70.8% and total model latency by 43.0%, with no increase in peak memory (Liu et al., 20 May 2025).

6. Complexity, Scalability, and Practical Implementation

  • Algorithmic cost: O(TMD)O(T \cdot M \cdot D') for score computation (cosine similarities), O(TM)O(T \cdot M) for softmax, O(TMlogM)O(T \cdot M\log M) per-frame for top-kk sorting. Computational overhead is negligible relative to total generation latency.
  • Integration: Inserted without retraining into ViT→MLP→LLM pipelines; system remains fully compatible with fast attention operators.
  • Scalability: Demonstrated effective down to 15% retention while maintaining >95% original performance; empirical results confirm robustness to both short and long video scenarios.

7. Future Directions

Potential research avenues include scaling VidCom² to larger VideoLLM backbones (e.g., 72B parameters), adapting the framework for real-time streaming scenarios, and developing joint learned or hybrid compression systems leveraging VidCom²'s principles of adaptive frame-level compression. Additionally, integration into hardware-accelerated video processing pipelines—as exemplified by modular FPGA video codec platforms—may further reduce system-level latency and resource utilization for video understanding applications (Liu et al., 20 May 2025, Parthasarathy et al., 15 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video Compression Commander (VidCom²).