CoPE-VideoLM: Efficient Video Language Modeling

Updated 17 February 2026

The paper introduces CoPE-VideoLM, which utilizes motion vectors and residuals from compressed videos to reduce computational cost and token usage.
It employs a Δ-Encoder with modality-alignment pre-training to convert codec primitives into compact tokens, ensuring seamless integration with large language models.
Benchmark results demonstrate up to an 86% reduction in time-to-first-token and a 93% reduction in token usage, paving the way for scalable and efficient video understanding.

CoPE-VideoLM is a video language modeling framework that leverages video codec primitives—specifically, motion vectors and residuals—rather than relying solely on full-frame image encodings. This approach exploits the inherent redundancy and sparsity captured by codecs in H.264/MPEG video streams, addressing key computational inefficiencies and temporal limitations in standard VideoLM pipelines. CoPE-VideoLM introduces a Δ-Encoder with modality-alignment pre-training, significantly reducing both token usage and inference latency while maintaining or surpassing standard VideoLM accuracy across a spectrum of video understanding benchmarks (Sarkar et al., 13 Feb 2026).

1. Motivation and Limitations of Keyframe-Based VideoLMs

Contemporary VideoLMs typically mitigate the substantial cost of encoding every video frame by utilizing a small selection of keyframes (such as 64 uniformly spaced images) to adhere to a fixed token budget. However, this heuristic is subject to key drawbacks:

Sparse Temporal Coverage: Sampling keyframes at coarse intervals risks omitting both macro-level events (long-range actions) and micro-level dynamics (fine-grained motions).
Redundant Computation: Adjacent keyframes encode significant redundant information; each, nevertheless, undergoes expensive feature extraction through a full vision encoder.
Latency Constraints: The requirement to encode every selected RGB frame inflates the time-to-first-token—the delay between input and the emission of the first language token—detrimentally affecting interactivity in real-time or robotics applications.

These limitations highlight the need for a framework that both preserves essential video characteristics and operates efficiently with respect to computational and memory demands (Sarkar et al., 13 Feb 2026).

2. Codec Primitives: Motion Vectors and Residuals

CoPE-VideoLM exploits the native data structure of compressed videos. In H.264/MPEG formats, each P-frame (predictive inter-frame) is stored as:

Motion Vector Field $\tau^{(t)}$ : $\mathbb{Z}^{H\times W\times 2}$ , encoding block-wise 2D pixel displacements relative to a reference frame (serving as coarse optical flow).
Residual Tensor $\delta^{(t)}$ : $\mathbb{R}^{H\times W\times C}$ , capturing pixel-level corrections following motion compensation.

The codec reconstructs pixel $i$ in frame $t$ via: $\hat I^{(t)}_i = \hat I^{(t-1)}_{i - \tau^{(t)}_i} + \delta^{(t)}_i \quad \forall\, i$ P-frames consequently retain only the incremental information required for reconstruction, yielding substantially smaller representations than I-frames (Sarkar et al., 13 Feb 2026).

3. Δ-Encoder Architecture and Modality Alignment

CoPE-VideoLM introduces the Δ-Encoder, a dual-branch lightweight transformer architecture for processing codec primitives:

Motion Branch:
- Splits $\tau^{(t)}$ into a grid of $16 \times 16$ blocks, flattening each to a 512-dimensional vector.
- Applies a small MLP and a 4-layer PreNorm transformer (9 heads, $K_\tau$ learnable query tokens), compressing the blockwise features into $K_\tau$ motion tokens.
Residual Branch:
- Processes $\delta^{(t)}$ using a truncated ResNet-18, yielding feature maps.
- Similarly, a transformer with $K_\delta$ learned queries compresses the output to $K_\delta$ residual tokens.

Each P-frame results in $N=K_\tau + K_\delta$ Δ-tokens (in practice, $N=8$ with $K_\tau=K_\delta=4$ ), compared to $M=210$ vision tokens for a single I-frame.

To ensure that Δ-tokens are compatible with LLM-expected image token embeddings, a modality-alignment pre-training aligns the Δ-Encoder outputs to the frozen vision encoder outputs via patch-wise MSE loss: $\mathcal{L}_{\rm align} = \frac{1}{M} \sum_{i=1}^M \|X_I^{(t)}(i) - \hat X_P^{(t)}(i)\|_2^2$ Here, $\hat X_P^{(t)}$ is produced by warping $X_I^{(t-1)}$ with motion tokens and updating with residual tokens through dedicated transformers. This ensures seamless token fusion during end-to-end instruction tuning (Sarkar et al., 13 Feb 2026).

4. Inference Pipeline and Efficiency Metrics

At inference, the pipeline interleaves full-RGB I-frame representations (via a frozen vision encoder) with P-frame Δ-tokens (via the Δ-Encoder) in temporal order. The downstream LLM (e.g., Qwen2) consumes a unified, mixed token stream without architectural modification.

Efficiency is quantified along two primary axes:

Time-to-First-Token $T_{\rm first}$ : Wall-clock time (in seconds) to generate the first output token.
Token Usage $U_{\rm tokens}$ : Relative proportion of visual tokens utilized compared to a dense baseline ( $U_{\rm tokens} = \frac{T_{\rm ours}}{T_{\rm base}} \times 100\%$ ).

On an RTX 4090, the configuration of 1 I-frame with 7 P-frames per group of pictures (GOP) yields:

Metric	CoPE-VideoLM	64-Frame Baseline
$T_{\rm first}$	0.33 s	2.39 s
Token usage	≈ 5%	100%
Latency (64 tokens)	1.66 s	3.78 s

These figures correspond to an 86% reduction in time-to-first-token and a 93% reduction in token usage relative to conventional keyframe-based VideoLMs (Sarkar et al., 13 Feb 2026).

5. Benchmark Performance and Robustness Across Tasks

Despite aggressive token compression, CoPE-VideoLM sustains or surpasses baseline VideoLM performance across 14 diverse benchmarks in four principal categories:

General Video QA: PerceptionTest (70.5% vs. 67.9%), NextQA (81.8% vs. 83.2%), ActivityNet-QA (58.8% vs. 56.5%), VideoMME (61.7% vs. 63.3%).
Temporal Reasoning: TempCompass (68.4% vs. 66.6%), TOMATO (28.3% vs. 24.9%), CVRR-ES (49.1% vs. 43.6%), MVBench (61.6% vs. 58.6%).
Long-Form Understanding & Instruction Following: Video-TT (44.3% vs. 41.8%), Video-MMMU (37.9% vs. 36.1%), LVBench (46.4% vs. 44.2%), LongVideoBench (56.9% vs. 58.2%).
Spatial Scene Understanding: ScanQA and SQA3D, matching or exceeding specialized 3D VLMs (following fine-tuning) using only video data.

This outcome suggests that Δ-tokens distilled from codec primitives retain sufficient semantic and motion detail to close the performance gap with full vision token streams, even at less than 10% of the usual token count (Sarkar et al., 13 Feb 2026).

6. Paradigm Shift and Broader Implications

CoPE-VideoLM illustrates a paradigm shift in video-language modeling by harnessing the native sparsity of compressed video. By directly utilizing motion vectors and residuals, it achieves a superior tradeoff between efficiency and accuracy. The framework’s codec-aware pre-training yields a robust adapter module capable of integrating with any LLM-based VideoLM. Applications in user-facing tasks and robotics benefit markedly from the reduced latency and token footprint, supporting hours-long video analysis and enabling rapid, low-latency responses without accuracy degradation. This establishes a new open-source standard for scalable and efficient VideoLMs (Sarkar et al., 13 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoPE-VideoLM.