Deep Video Discovery (DVD)
- Deep Video Discovery (DVD) is an agentic framework that segments and indexes long-form videos into uniform clips for effective query answering.
- It leverages a multi-granular database and three search-centric tools—GlobalBrowse, ClipSearch, and FrameInspect—to analyze content at global, clip, and frame levels.
- Empirical evaluations on benchmark tasks demonstrate state-of-the-art accuracy, underscoring the benefits of iterative planning and dynamic tool use.
Deep Video Discovery (DVD) is an agentic framework for long-form video understanding, designed to address the computational and inferential challenges in answering queries over temporally extended, information-dense videos. The DVD agent operates by segmenting a given video into uniform clips, constructing multi-granular databases from these segments, and leveraging a set of parameterized, search-centric tools orchestrated by a LLM. This paradigm shifts away from static, manually-designed agent workflows, emphasizing autonomous reasoning, dynamic planning, and iterative tool use to achieve state-of-the-art comprehension on benchmark tasks involving hour-long content (Zhang et al., 23 May 2025).
1. Formal Problem Setup
Given a long input video , DVD partitions it into non-overlapping clips , where
and is a fixed clip duration (empirically, seconds). Each clip is decoded to frames (at 2 fps), captioned (), and embedded (). A structured database is constructed as
where is a progressive subject registry aggregating high-level entities and attributes across the video.
The agent faces a user query and must output a correct answer through a decision-making process defined over a discrete action space:
At each step , the LLM-powered agent observes its accumulated history , selects an action with parameters , receives an observation , and updates its history. The planning objective is to select a policy (realized by in-context prompt engineering of the LLM) that maximizes the expected probability of correct final answers.
2. Multi-Granular Video Database Construction
DVD builds a multi-tiered representation for scalable retrieval and inspection:
- Global (registry level): The subject registry logs salient entities and their attributes (name, appearance, temporal span), appended incrementally by running a vision–LLM (VLM) over each key frame with carry-over memory.
- Clip (semantic level): A corpus of captions and their dense embeddings supports semantic search.
- Frame (pixel level): Raw video frames , indexed by clip, enable direct reference and pixel-level VQA.
This database supports three modalities of access: (a) global context browsing, (b) subsecond semantic retrieval, and (c) frame-level question answering.
3. Agentic Search Tools
The agent is provisioned with three principal tools, each with parameterized interfaces for interacting with :
- GlobalBrowse (): Returns a subject-centric summary (from ) and an event-centric summary (VLM-based, driven by ) for coarse, high-level orientation.
- ClipSearch (): The agent synthesizes a semantic query and retrieves top- temporally-localized clips, computing cosine similarity between and each .
- FrameInspect (): Samples up to 50 frames from a temporal range and performs VQA-based, pixel-grounded inference on the sub-query .
The "Answer" action consolidates all gathered evidence and prompts the LLM for the final response.
4. Agent Architecture and Iterative Planning Loop
The core agentic logic operates via an observe–reason–act loop, resembling the ReAct paradigm. The agent interleaves chain-of-thought reflection, tool selection, evidence acquisition, and iterative state augmentation:
1 2 3 4 5 6 7 8 9 |
1. Input: Q, max steps T, LLM M, toolset T
2. Initialize H_0 = {Q, A}
3. For t = 1 ... T:
R_t = M.reason(H_{t-1})
(a_t, p_t) = M.call(R_t, H_{t-1})
If a_t = Answer: break
o_t = a_t(p_t)
H_t = H_{t-1} ∪ {(R_t, a_t, o_t)}
4. If no Answer yet, set 𝑦̂ = M.answer(H_T) |
5. Decision-Process Framing
The agent's operation is modeled as an episodic Markov Decision Process where:
- State
- Action
- State evolves as
- Reward is for ; if
No explicit reinforcement learning is performed; policy learning is implicit, subsumed into the LLM's zero-shot, in-prompt reasoning.
6. Implementation: Model Choices and Execution
All DVD subcomponents utilize existing large-scale pretrained models:
- Captioning: GPT-4.1; memory-augmented at the database construction stage.
- Embeddings: Fixed, frozen language encoder projecting to .
- ClipInspect and VQA: OpenAI o3 (or 4-mini), invoked for few-shot visual QA.
- Core agent reasoning: OpenAI o3 via Azure API.
No fine-tuning or new loss functions are employed; the agent's flexibility and learning capabilities emerge from in-context prompt engineering and dynamic tool invocation.
7. Empirical Performance and Ablation Analysis
On the LVBench benchmark (1,549 multiple-choice questions across 103 hours of video), DVD attains 71.9% overall accuracy, exceeding MR Video (60.8%) by 11.1 points and OpenAI o3 (57.1%) by 14.8. Incorporating auxiliary transcripts (WhisperX) raises accuracy to 74.1%. DVD establishes state-of-the-art across all evaluated benchmark categories, including LongVideoBench (70.5% overall; +3.5 over prior SOTA), Video MME (66.8%; +5.0), and EgoSchema (76.6%; +3.0).
Ablation studies reveal the contribution of each component (with transcripts, base score 74.1%):
| Configuration | Score (%) | Delta |
|---|---|---|
| No GlobalBrowse | 70.0 | –4.1 |
| No ClipSearch | 57.7 | –16.4 |
| No FrameInspect | 62.3 | –11.8 |
These results confirm critical synergy among all tools, with ClipSearch and FrameInspect especially essential for top-level accuracy.
8. Behavioral and Computational Analysis
Systematic behavioral tracing shows five agent interaction patterns (e.g., "Simple Action", "Iterative Search", "ClipSearch Trap"), with analysis indicating that longer agentic chains generally improve answer accuracy. However, chains that are excessively long often denote underlying model uncertainty. Short, overconfident chains (as seen in GPT-4 o, average chain-length 4.6) correlate with reduced accuracy. While DVD's iterative, tool-driven planning loop is central to performance, it introduces notable computational overhead.
9. Limitations and Prospects
DVD reframes long-form video question answering as an agentic search over a multi-granular database, orchestrated by a plan-as-you-go LLM. While achieving state-of-the-art benchmarks, the current design incurs nontrivial compute due to repeated tool use and database access. Future research may focus on optimizing data indexing and developing more sample- and compute-efficient tool invocation policies, potentially leveraging learning-based strategies to minimize overhead while retaining high accuracy (Zhang et al., 23 May 2025).