Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoPE-VideoLM: Efficient Video Language Modeling

Updated 17 February 2026
  • The paper introduces CoPE-VideoLM, which utilizes motion vectors and residuals from compressed videos to reduce computational cost and token usage.
  • It employs a Δ-Encoder with modality-alignment pre-training to convert codec primitives into compact tokens, ensuring seamless integration with large language models.
  • Benchmark results demonstrate up to an 86% reduction in time-to-first-token and a 93% reduction in token usage, paving the way for scalable and efficient video understanding.

CoPE-VideoLM is a video language modeling framework that leverages video codec primitives—specifically, motion vectors and residuals—rather than relying solely on full-frame image encodings. This approach exploits the inherent redundancy and sparsity captured by codecs in H.264/MPEG video streams, addressing key computational inefficiencies and temporal limitations in standard VideoLM pipelines. CoPE-VideoLM introduces a Δ-Encoder with modality-alignment pre-training, significantly reducing both token usage and inference latency while maintaining or surpassing standard VideoLM accuracy across a spectrum of video understanding benchmarks (Sarkar et al., 13 Feb 2026).

1. Motivation and Limitations of Keyframe-Based VideoLMs

Contemporary VideoLMs typically mitigate the substantial cost of encoding every video frame by utilizing a small selection of keyframes (such as 64 uniformly spaced images) to adhere to a fixed token budget. However, this heuristic is subject to key drawbacks:

  • Sparse Temporal Coverage: Sampling keyframes at coarse intervals risks omitting both macro-level events (long-range actions) and micro-level dynamics (fine-grained motions).
  • Redundant Computation: Adjacent keyframes encode significant redundant information; each, nevertheless, undergoes expensive feature extraction through a full vision encoder.
  • Latency Constraints: The requirement to encode every selected RGB frame inflates the time-to-first-token—the delay between input and the emission of the first language token—detrimentally affecting interactivity in real-time or robotics applications.

These limitations highlight the need for a framework that both preserves essential video characteristics and operates efficiently with respect to computational and memory demands (Sarkar et al., 13 Feb 2026).

2. Codec Primitives: Motion Vectors and Residuals

CoPE-VideoLM exploits the native data structure of compressed videos. In H.264/MPEG formats, each P-frame (predictive inter-frame) is stored as:

  • Motion Vector Field τ(t)\tau^{(t)}: ZH×W×2\mathbb{Z}^{H\times W\times 2}, encoding block-wise 2D pixel displacements relative to a reference frame (serving as coarse optical flow).
  • Residual Tensor δ(t)\delta^{(t)}: RH×W×C\mathbb{R}^{H\times W\times C}, capturing pixel-level corrections following motion compensation.

The codec reconstructs pixel ii in frame tt via: I^i(t)=I^iτi(t)(t1)+δi(t)i\hat I^{(t)}_i = \hat I^{(t-1)}_{i - \tau^{(t)}_i} + \delta^{(t)}_i \quad \forall\, i P-frames consequently retain only the incremental information required for reconstruction, yielding substantially smaller representations than I-frames (Sarkar et al., 13 Feb 2026).

3. Δ-Encoder Architecture and Modality Alignment

CoPE-VideoLM introduces the Δ-Encoder, a dual-branch lightweight transformer architecture for processing codec primitives:

  • Motion Branch:
    • Splits τ(t)\tau^{(t)} into a grid of 16×1616 \times 16 blocks, flattening each to a 512-dimensional vector.
    • Applies a small MLP and a 4-layer PreNorm transformer (9 heads, KτK_\tau learnable query tokens), compressing the blockwise features into KτK_\tau motion tokens.
  • Residual Branch:
    • Processes δ(t)\delta^{(t)} using a truncated ResNet-18, yielding feature maps.
    • Similarly, a transformer with KδK_\delta learned queries compresses the output to KδK_\delta residual tokens.

Each P-frame results in N=Kτ+KδN=K_\tau + K_\delta Δ-tokens (in practice, N=8N=8 with Kτ=Kδ=4K_\tau=K_\delta=4), compared to M=210M=210 vision tokens for a single I-frame.

To ensure that Δ-tokens are compatible with LLM-expected image token embeddings, a modality-alignment pre-training aligns the Δ-Encoder outputs to the frozen vision encoder outputs via patch-wise MSE loss: Lalign=1Mi=1MXI(t)(i)X^P(t)(i)22\mathcal{L}_{\rm align} = \frac{1}{M} \sum_{i=1}^M \|X_I^{(t)}(i) - \hat X_P^{(t)}(i)\|_2^2 Here, X^P(t)\hat X_P^{(t)} is produced by warping XI(t1)X_I^{(t-1)} with motion tokens and updating with residual tokens through dedicated transformers. This ensures seamless token fusion during end-to-end instruction tuning (Sarkar et al., 13 Feb 2026).

4. Inference Pipeline and Efficiency Metrics

At inference, the pipeline interleaves full-RGB I-frame representations (via a frozen vision encoder) with P-frame Δ-tokens (via the Δ-Encoder) in temporal order. The downstream LLM (e.g., Qwen2) consumes a unified, mixed token stream without architectural modification.

Efficiency is quantified along two primary axes:

  • Time-to-First-Token TfirstT_{\rm first}: Wall-clock time (in seconds) to generate the first output token.
  • Token Usage UtokensU_{\rm tokens}: Relative proportion of visual tokens utilized compared to a dense baseline (Utokens=ToursTbase×100%U_{\rm tokens} = \frac{T_{\rm ours}}{T_{\rm base}} \times 100\%).

On an RTX 4090, the configuration of 1 I-frame with 7 P-frames per group of pictures (GOP) yields:

Metric CoPE-VideoLM 64-Frame Baseline
TfirstT_{\rm first} 0.33 s 2.39 s
Token usage ≈ 5% 100%
Latency (64 tokens) 1.66 s 3.78 s

These figures correspond to an 86% reduction in time-to-first-token and a 93% reduction in token usage relative to conventional keyframe-based VideoLMs (Sarkar et al., 13 Feb 2026).

5. Benchmark Performance and Robustness Across Tasks

Despite aggressive token compression, CoPE-VideoLM sustains or surpasses baseline VideoLM performance across 14 diverse benchmarks in four principal categories:

  • General Video QA: PerceptionTest (70.5% vs. 67.9%), NextQA (81.8% vs. 83.2%), ActivityNet-QA (58.8% vs. 56.5%), VideoMME (61.7% vs. 63.3%).
  • Temporal Reasoning: TempCompass (68.4% vs. 66.6%), TOMATO (28.3% vs. 24.9%), CVRR-ES (49.1% vs. 43.6%), MVBench (61.6% vs. 58.6%).
  • Long-Form Understanding & Instruction Following: Video-TT (44.3% vs. 41.8%), Video-MMMU (37.9% vs. 36.1%), LVBench (46.4% vs. 44.2%), LongVideoBench (56.9% vs. 58.2%).
  • Spatial Scene Understanding: ScanQA and SQA3D, matching or exceeding specialized 3D VLMs (following fine-tuning) using only video data.

This outcome suggests that Δ-tokens distilled from codec primitives retain sufficient semantic and motion detail to close the performance gap with full vision token streams, even at less than 10% of the usual token count (Sarkar et al., 13 Feb 2026).

6. Paradigm Shift and Broader Implications

CoPE-VideoLM illustrates a paradigm shift in video-language modeling by harnessing the native sparsity of compressed video. By directly utilizing motion vectors and residuals, it achieves a superior tradeoff between efficiency and accuracy. The framework’s codec-aware pre-training yields a robust adapter module capable of integrating with any LLM-based VideoLM. Applications in user-facing tasks and robotics benefit markedly from the reduced latency and token footprint, supporting hours-long video analysis and enabling rapid, low-latency responses without accuracy degradation. This establishes a new open-source standard for scalable and efficient VideoLMs (Sarkar et al., 13 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoPE-VideoLM.