Papers
Topics
Authors
Recent
Search
2000 character limit reached

Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version)

Published 16 Sep 2025 in cs.CV | (2509.13161v1)

Abstract: Despite the prosperity of the video LLM, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a LLM to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video LLMs. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video's knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the LLM. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video LLMs.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.