CoPE-VideoLM: Efficient Video Language Modeling
- The paper introduces CoPE-VideoLM, which utilizes motion vectors and residuals from compressed videos to reduce computational cost and token usage.
- It employs a Δ-Encoder with modality-alignment pre-training to convert codec primitives into compact tokens, ensuring seamless integration with large language models.
- Benchmark results demonstrate up to an 86% reduction in time-to-first-token and a 93% reduction in token usage, paving the way for scalable and efficient video understanding.
CoPE-VideoLM is a video language modeling framework that leverages video codec primitives—specifically, motion vectors and residuals—rather than relying solely on full-frame image encodings. This approach exploits the inherent redundancy and sparsity captured by codecs in H.264/MPEG video streams, addressing key computational inefficiencies and temporal limitations in standard VideoLM pipelines. CoPE-VideoLM introduces a Δ-Encoder with modality-alignment pre-training, significantly reducing both token usage and inference latency while maintaining or surpassing standard VideoLM accuracy across a spectrum of video understanding benchmarks (Sarkar et al., 13 Feb 2026).
1. Motivation and Limitations of Keyframe-Based VideoLMs
Contemporary VideoLMs typically mitigate the substantial cost of encoding every video frame by utilizing a small selection of keyframes (such as 64 uniformly spaced images) to adhere to a fixed token budget. However, this heuristic is subject to key drawbacks:
- Sparse Temporal Coverage: Sampling keyframes at coarse intervals risks omitting both macro-level events (long-range actions) and micro-level dynamics (fine-grained motions).
- Redundant Computation: Adjacent keyframes encode significant redundant information; each, nevertheless, undergoes expensive feature extraction through a full vision encoder.
- Latency Constraints: The requirement to encode every selected RGB frame inflates the time-to-first-token—the delay between input and the emission of the first language token—detrimentally affecting interactivity in real-time or robotics applications.
These limitations highlight the need for a framework that both preserves essential video characteristics and operates efficiently with respect to computational and memory demands (Sarkar et al., 13 Feb 2026).
2. Codec Primitives: Motion Vectors and Residuals
CoPE-VideoLM exploits the native data structure of compressed videos. In H.264/MPEG formats, each P-frame (predictive inter-frame) is stored as:
- Motion Vector Field : , encoding block-wise 2D pixel displacements relative to a reference frame (serving as coarse optical flow).
- Residual Tensor : , capturing pixel-level corrections following motion compensation.
The codec reconstructs pixel in frame via: P-frames consequently retain only the incremental information required for reconstruction, yielding substantially smaller representations than I-frames (Sarkar et al., 13 Feb 2026).
3. Δ-Encoder Architecture and Modality Alignment
CoPE-VideoLM introduces the Δ-Encoder, a dual-branch lightweight transformer architecture for processing codec primitives:
- Motion Branch:
- Residual Branch:
- Processes using a truncated ResNet-18, yielding feature maps.
- Similarly, a transformer with learned queries compresses the output to residual tokens.
Each P-frame results in Δ-tokens (in practice, with ), compared to vision tokens for a single I-frame.
To ensure that Δ-tokens are compatible with LLM-expected image token embeddings, a modality-alignment pre-training aligns the Δ-Encoder outputs to the frozen vision encoder outputs via patch-wise MSE loss: Here, is produced by warping with motion tokens and updating with residual tokens through dedicated transformers. This ensures seamless token fusion during end-to-end instruction tuning (Sarkar et al., 13 Feb 2026).
4. Inference Pipeline and Efficiency Metrics
At inference, the pipeline interleaves full-RGB I-frame representations (via a frozen vision encoder) with P-frame Δ-tokens (via the Δ-Encoder) in temporal order. The downstream LLM (e.g., Qwen2) consumes a unified, mixed token stream without architectural modification.
Efficiency is quantified along two primary axes:
- Time-to-First-Token : Wall-clock time (in seconds) to generate the first output token.
- Token Usage : Relative proportion of visual tokens utilized compared to a dense baseline ().
On an RTX 4090, the configuration of 1 I-frame with 7 P-frames per group of pictures (GOP) yields:
| Metric | CoPE-VideoLM | 64-Frame Baseline |
|---|---|---|
| 0.33 s | 2.39 s | |
| Token usage | ≈ 5% | 100% |
| Latency (64 tokens) | 1.66 s | 3.78 s |
These figures correspond to an 86% reduction in time-to-first-token and a 93% reduction in token usage relative to conventional keyframe-based VideoLMs (Sarkar et al., 13 Feb 2026).
5. Benchmark Performance and Robustness Across Tasks
Despite aggressive token compression, CoPE-VideoLM sustains or surpasses baseline VideoLM performance across 14 diverse benchmarks in four principal categories:
- General Video QA: PerceptionTest (70.5% vs. 67.9%), NextQA (81.8% vs. 83.2%), ActivityNet-QA (58.8% vs. 56.5%), VideoMME (61.7% vs. 63.3%).
- Temporal Reasoning: TempCompass (68.4% vs. 66.6%), TOMATO (28.3% vs. 24.9%), CVRR-ES (49.1% vs. 43.6%), MVBench (61.6% vs. 58.6%).
- Long-Form Understanding & Instruction Following: Video-TT (44.3% vs. 41.8%), Video-MMMU (37.9% vs. 36.1%), LVBench (46.4% vs. 44.2%), LongVideoBench (56.9% vs. 58.2%).
- Spatial Scene Understanding: ScanQA and SQA3D, matching or exceeding specialized 3D VLMs (following fine-tuning) using only video data.
This outcome suggests that Δ-tokens distilled from codec primitives retain sufficient semantic and motion detail to close the performance gap with full vision token streams, even at less than 10% of the usual token count (Sarkar et al., 13 Feb 2026).
6. Paradigm Shift and Broader Implications
CoPE-VideoLM illustrates a paradigm shift in video-language modeling by harnessing the native sparsity of compressed video. By directly utilizing motion vectors and residuals, it achieves a superior tradeoff between efficiency and accuracy. The framework’s codec-aware pre-training yields a robust adapter module capable of integrating with any LLM-based VideoLM. Applications in user-facing tasks and robotics benefit markedly from the reduced latency and token footprint, supporting hours-long video analysis and enabling rapid, low-latency responses without accuracy degradation. This establishes a new open-source standard for scalable and efficient VideoLMs (Sarkar et al., 13 Feb 2026).