VidCom²: Plug-and-play Adaptive Video Compression
- VidCom² is a video compression framework that adaptively assigns token retention per frame based on uniqueness to reduce computational overhead.
- It employs efficient vector operations and rank statistics to compress tokens while preserving essential visual information, achieving up to 70.8% latency reduction.
- VidCom² integrates seamlessly with modern VideoLLMs without retraining, ensuring compatibility with optimized attention operators like FlashAttention-2.
Video Compression Commander (VidCom²) is a plug-and-play, training-free inference acceleration framework that addresses the efficiency bottlenecks of Video LLMs (VideoLLMs) arising from dense visual token streams. By adaptively compressing visual tokens based on frame-level uniqueness, VidCom² preserves essential information while achieving substantial reductions in computational cost and latency. It is explicitly designed for compatibility with a wide range of modern VideoLLM architectures and efficient attention operators, requiring no retraining or architectural modifications (Liu et al., 20 May 2025).
1. Motivation and Context
VideoLLMs, such as LLaVA-OneVision, LLaVA-Video, and Qwen2-VL, ingest sequences of frames where each frame produces visual tokens. The resulting tensor has tokens, and downstream self- and cross-attention operations have time and space complexity . This quadratic scaling renders inference prohibitively expensive, particularly for long video sequences or models with large . Existing token compression approaches prior to VidCom² either:
- Depend on [CLS] tokens (pre-LLM compression, e.g., FasterVLM, MUSTDrop, FiCoCo), rendering them incompatible with modern SigLIP-based encoders lacking a [CLS] token.
- Require explicit inspection of LLM attention matrices (intra-LLM compression, e.g., FastV, PDrop, SparseVLM), which is incompatible with optimized attention kernels such as FlashAttention and incurs significant memory overhead.
- Apply fixed-window, uniform compression budgets regardless of per-frame informativeness (e.g., DyCoke), missing opportunities to preserve uniquely informative frames, particularly under aggressive token reduction regimes (Liu et al., 20 May 2025).
2. Foundational Design Principles
VidCom² is guided by three core principles:
- Model Adaptability: The method must be plug-and-play on arbitrary VideoLLM pipelines (including SigLIP, ViT→MLP→LLM) with no retraining or architecture changes.
- Frame Uniqueness: The allocation of compression budgets across frames must be adaptive, favoring frames with higher relative uniqueness in representing salient content.
- Operator Compatibility: The method must avoid any reliance on [CLS] tokens or explicit LLM attention weights, maintaining compatibility with optimized attention implementations (e.g. FlashAttention-2), and must not increase peak memory usage (Liu et al., 20 May 2025).
3. Algorithmic Architecture
VidCom² operates in two stages—Frame Compression Adjustment and Adaptive Token Compression—requiring only vector operations and rank statistics:
3.1 Quantifying Frame Uniqueness
Let represent the visual tokens generated by the video encoder.
- Global video representation:
where .
- Token-to-video similarity (cosine):
Tokens with higher are more unique relative to the video as a whole.
- Frame uniqueness score:
Higher indicates a more distinctive frame.
3.2 Adaptive Compression Assignment
- Frame scores to compression ratios via softmax normalization (temperature ):
with for numerical stability.
- Per-frame retention ratio:
where is the global retention ratio. by construction.
- Algorithmic pseudocode (Frame Compression Adjustment):
1 2 3 4 5 6
for t=1..T: u_t ← (1/M) * sum_{m=1}^M (– cosine(x_{t,m}, g_v)) Normalize: ŭ_t ← (u_t – max_k u_k)/τ σ_t ← exp(ŭ_t) / (sum_k exp(ŭ_k)+ε) r_t ← R*(1 + σ_t – 1/T)
3.3 Token Selection Within Frames
Given , select tokens per frame via combined video- and frame-level uniqueness:
- Frame-global representation:
- Token-to-frame similarity and uniqueness:
- Combined token score:
Retain the top tokens by ranking (Liu et al., 20 May 2025).
4. System Integration and Operator Compatibility
VidCom² is positioned between the video encoder (ViT→MLP) and the LLM. Its operations consist solely of pooling, cosine similarities, softmax normalization, and top- selection. It never requires explicit inspection of LLM attention or [CLS] tokens, ensuring maximal compatibility with highly efficient operator kernels such as FlashAttention-2. Peak memory footprint is not increased, as all computation occurs on encoder outputs prior to LLM input (Liu et al., 20 May 2025).
Furthermore, the Frame Compression Adjustment strategy can function as a modular "plug-in" for other token selection schemes (e.g., FastV, SparseVLM), replacing the uniform per-frame token budget with the optimally adaptive allocation.
5. Empirical Performance
Evaluation is conducted across VideoLLMs (LLaVA-OneVision 7B, LLaVA-Video, Qwen2-VL) and video understanding benchmarks including MVBench, LongVideoBench, MLVU, and VideoMME (S/M/L) via LMMs-Eval. Key findings:
| Method | Retention | LLaVA-OV-7B Avg | % of Orig | Latency Reduction | Throughput |
|---|---|---|---|---|---|
| Vanilla | 100% | 56.9 | 100.0% | -- | 0.64× |
| VidCom² | 25% | 57.2 | 99.6% | −70.8% LLM Gen | 1.38× |
| DyCoke@30% | 30% | 56.6 | 96.5% | -- | -- |
| VidCom² | 15% | -- | 95.1% | -- | -- |
For Qwen2-VL on long VideoMME, VidCom² at 25% retention attains 101.2% of reference performance, exceeding DyCoke (93.6%) and SparseVLM (96.6%).
Token scoring that combines frame and video uniqueness outperforms either approach alone. Averaging token uniqueness is preferred for , and the Frame Compression Adjustment module consistently improves the robustness of other token compression schemes. Complexity overhead is low: on the MVBench set, the end-to-end scoring for all tokens adds only 2.5 s, equivalent to 1.3% of LLM generation time, while reducing overall LLM generation latency by 70.8% and total model latency by 43.0%, with no increase in peak memory (Liu et al., 20 May 2025).
6. Complexity, Scalability, and Practical Implementation
- Algorithmic cost: for score computation (cosine similarities), for softmax, per-frame for top- sorting. Computational overhead is negligible relative to total generation latency.
- Integration: Inserted without retraining into ViT→MLP→LLM pipelines; system remains fully compatible with fast attention operators.
- Scalability: Demonstrated effective down to 15% retention while maintaining >95% original performance; empirical results confirm robustness to both short and long video scenarios.
7. Future Directions
Potential research avenues include scaling VidCom² to larger VideoLLM backbones (e.g., 72B parameters), adapting the framework for real-time streaming scenarios, and developing joint learned or hybrid compression systems leveraging VidCom²'s principles of adaptive frame-level compression. Additionally, integration into hardware-accelerated video processing pipelines—as exemplified by modular FPGA video codec platforms—may further reduce system-level latency and resource utilization for video understanding applications (Liu et al., 20 May 2025, Parthasarathy et al., 15 Sep 2025).