Papers
Topics
Authors
Recent
Search
2000 character limit reached

Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models

Published 21 Mar 2025 in cs.CV, cs.AI, cs.CL, and cs.LG | (2503.16980v3)

Abstract: Token-based video representation has emerged as a promising approach for enabling LLMs to interpret video content. However, existing token reduction, such as token pruning and token merging, often disrupt essential spatial-temporal positional embeddings, failing to adequately balance computational efficiency with fewer tokens. Consequently, these methods result in lengthy token sequences, limiting their applicability in scenarios requiring extreme token compression, such as video LLMs. In this paper, we introduce the novel task of extreme short token reduction, aiming to represent extensive video sequences with a minimal number of tokens. To address this challenge, we propose Token Dynamics, a new video representation framework that dynamically reduces token count while preserving spatial-temporal coherence. Specifically, we disentangle video representations by separating visual embeddings from grid-level motion information, structuring them into: 1. a concise token hash table, created by clustering tokens that describe object-level content; 2. a token indices key map, capturing detailed spatial-temporal motion patterns across grids; 3. a token hash function, which vector-quantizes the token hash table to reconstruct the token sequence from the key map. Furthermore, we introduce a cross-dynamics attention mechanism that integrates motion features into the token base without increasing token length, thereby maintaining compactness and spatial-temporal integrity. The experiments demonstrate a reduction of token count to merely 0.07% of the original tokens, with only a minor performance drop of 1.13%. Additionally, we propose two novel subtasks within extreme token reduction (fixed-length and adaptive-length compression). Our method offers significantly lower theoretical complexity, fewer tokens, and enhanced throughput, thus providing an efficient solution for video LLMs.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.