Deep Video Discovery (DVD)

Updated 23 January 2026

Deep Video Discovery (DVD) is an agentic framework that segments and indexes long-form videos into uniform clips for effective query answering.
It leverages a multi-granular database and three search-centric tools—GlobalBrowse, ClipSearch, and FrameInspect—to analyze content at global, clip, and frame levels.
Empirical evaluations on benchmark tasks demonstrate state-of-the-art accuracy, underscoring the benefits of iterative planning and dynamic tool use.

Deep Video Discovery (DVD) is an agentic framework for long-form video understanding, designed to address the computational and inferential challenges in answering queries over temporally extended, information-dense videos. The DVD agent operates by segmenting a given video into uniform clips, constructing multi-granular databases from these segments, and leveraging a set of parameterized, search-centric tools orchestrated by a LLM. This paradigm shifts away from static, manually-designed agent workflows, emphasizing autonomous reasoning, dynamic planning, and iterative tool use to achieve state-of-the-art comprehension on benchmark tasks involving hour-long content (Zhang et al., 23 May 2025).

1. Formal Problem Setup

Given a long input video $V$ , DVD partitions it into $N$ non-overlapping clips $\{v_i\}_{i=1}^N$ , where

$N = \Bigl\lceil \frac{\mathrm{len}(V)}{t} \Bigr\rceil$

and $t$ is a fixed clip duration (empirically, $t=5$ seconds). Each clip $v_i$ is decoded to frames $f_i$ (at 2 fps), captioned ( $c_i$ ), and embedded ( $e_i \in \mathbb{R}^d$ ). A structured database is constructed as

$\mathcal{D} = \left\{ S, \left\{ f_i,\, c_i,\, e_i \right\}_{i=1}^N \right\}$

where $S$ is a progressive subject registry aggregating high-level entities and attributes across the video.

The agent faces a user query $Q$ and must output a correct answer $\hat{y}$ through a decision-making process defined over a discrete action space:

$\mathcal{A} = \{\mathsf{GlobalBrowse},\, \mathsf{ClipSearch},\, \mathsf{FrameInspect}\} \cup \{\mathsf{Answer}\}$

At each step $t$ , the LLM-powered agent observes its accumulated history $H_{t-1}$ , selects an action $a_t$ with parameters $p_t$ , receives an observation $o_t$ , and updates its history. The planning objective is to select a policy $\pi$ (realized by in-context prompt engineering of the LLM) that maximizes the expected probability of correct final answers.

2. Multi-Granular Video Database Construction

DVD builds a multi-tiered representation for scalable retrieval and inspection:

Global (registry level): The subject registry $S$ logs salient entities and their attributes (name, appearance, temporal span), appended incrementally by running a vision–LLM (VLM) over each key frame with carry-over memory.
Clip (semantic level): A corpus of captions $\{c_i\}$ and their dense embeddings $\{e_i\}$ supports semantic search.
Frame (pixel level): Raw video frames $\{f_i\}$ , indexed by clip, enable direct reference and pixel-level VQA.

This database supports three modalities of access: (a) global context browsing, (b) subsecond semantic retrieval, and (c) frame-level question answering.

3. Agentic Search Tools

The agent is provisioned with three principal tools, each with parameterized interfaces for interacting with $\mathcal{D}$ :

GlobalBrowse ( $\mathcal{D}, Q$ ): Returns a subject-centric summary (from $S$ ) and an event-centric summary (VLM-based, driven by $Q$ ) for coarse, high-level orientation.
ClipSearch ( $\mathcal{D}, \hat{Q}, k$ ): The agent synthesizes a semantic query $\hat{Q}$ and retrieves top- $k$ temporally-localized clips, computing cosine similarity between $\mathrm{Embed}(\hat{Q})$ and each $e_i$ .
FrameInspect ( $\mathcal{D}, \hat{Q}, [t_s, t_e]$ ): Samples up to 50 frames from a temporal range and performs VQA-based, pixel-grounded inference on the sub-query $\hat{Q}$ .

The "Answer" action consolidates all gathered evidence and prompts the LLM for the final response.

4. Agent Architecture and Iterative Planning Loop

The core agentic logic operates via an observe–reason–act loop, resembling the ReAct paradigm. The agent interleaves chain-of-thought reflection, tool selection, evidence acquisition, and iterative state augmentation:

1. Input: Q, max steps T, LLM M, toolset T
2. Initialize H_0 = {Q, A}
3. For t = 1 ... T:
      R_t = M.reason(H_{t-1})
      (a_t, p_t) = M.call(R_t, H_{t-1})
      If a_t = Answer: break
      o_t = a_t(p_t)
      H_t = H_{t-1} ∪ {(R_t, a_t, o_t)}
4. If no Answer yet, set 𝑦̂ = M.answer(H_T)

At each step, the LLM evaluates all preceding history, dynamically plans which tool and parameterization would most increase task-relevant knowledge, and determines when to terminate.

5. Decision-Process Framing

The agent's operation is modeled as an episodic Markov Decision Process where:

State $s_t = H_{t-1}$
Action $a_t \in \mathcal{A}$
State evolves as $s_t \leftarrow s_{t-1} \cup \{ (a_t, o_t) \}$
Reward is $r_t = 0$ for $t < T$ ; $r_T = \mathbb{1}\{\hat{y} = y\}$ if $a_T = \mathsf{Answer}$

No explicit reinforcement learning is performed; policy learning is implicit, subsumed into the LLM's zero-shot, in-prompt reasoning.

6. Implementation: Model Choices and Execution

All DVD subcomponents utilize existing large-scale pretrained models:

Captioning: GPT-4.1; memory-augmented at the database construction stage.
Embeddings: Fixed, frozen language encoder projecting $c_i$ to $e_i$ .
ClipInspect and VQA: OpenAI o3 (or 4-mini), invoked for few-shot visual QA.
Core agent reasoning: OpenAI o3 via Azure API.

No fine-tuning or new loss functions are employed; the agent's flexibility and learning capabilities emerge from in-context prompt engineering and dynamic tool invocation.

7. Empirical Performance and Ablation Analysis

On the LVBench benchmark (1,549 multiple-choice questions across 103 hours of video), DVD attains 71.9% overall accuracy, exceeding MR Video (60.8%) by 11.1 points and OpenAI o3 (57.1%) by 14.8. Incorporating auxiliary transcripts (WhisperX) raises accuracy to 74.1%. DVD establishes state-of-the-art across all evaluated benchmark categories, including LongVideoBench (70.5% overall; +3.5 over prior SOTA), Video MME (66.8%; +5.0), and EgoSchema (76.6%; +3.0).

Ablation studies reveal the contribution of each component (with transcripts, base score 74.1%):

Configuration	Score (%)	Delta
No GlobalBrowse	70.0	–4.1
No ClipSearch	57.7	–16.4
No FrameInspect	62.3	–11.8

These results confirm critical synergy among all tools, with ClipSearch and FrameInspect especially essential for top-level accuracy.

8. Behavioral and Computational Analysis

Systematic behavioral tracing shows five agent interaction patterns (e.g., "Simple Action", "Iterative Search", "ClipSearch Trap"), with analysis indicating that longer agentic chains generally improve answer accuracy. However, chains that are excessively long often denote underlying model uncertainty. Short, overconfident chains (as seen in GPT-4 o, average chain-length 4.6) correlate with reduced accuracy. While DVD's iterative, tool-driven planning loop is central to performance, it introduces notable computational overhead.

9. Limitations and Prospects

DVD reframes long-form video question answering as an agentic search over a multi-granular database, orchestrated by a plan-as-you-go LLM. While achieving state-of-the-art benchmarks, the current design incurs nontrivial compute due to repeated tool use and database access. Future research may focus on optimizing data indexing and developing more sample- and compute-efficient tool invocation policies, potentially leveraging learning-based strategies to minimize overhead while retaining high accuracy (Zhang et al., 23 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Video Discovery (DVD).