VideoFOV: Geometric Retrieval in Video Editing
- VideoFOV metric is a geometric scoring function that quantifies the union of visible directions from camera trajectories to ensure cross-consistency in multi-turn video editing.
- It computes overlap and containment between video-level fields of view, providing a balanced similarity measure for selecting informative prior edits.
- Empirical results in Memory-V2V show that VideoFOV-based retrieval significantly reduces pose drift and improves multi-view consistency during iterative video synthesis.
The VideoFOV metric, introduced in the context of multi-turn video editing and particularly for evaluating cross-consistency in video novel view synthesis, quantifies the geometric relevance of prior camera trajectories for memory-augmented video-to-video diffusion models. This metric plays a crucial role in retrieval strategies for explicit memory caches, facilitating the selection of past video edit states that are most informative and consistent with a target camera trajectory. Its formulation evaluates the overlap and containment between fields of view induced by camera trajectories across the entire video, serving as a foundational similarity function for cache retrieval within frameworks such as Memory-V2V (Lee et al., 22 Jan 2026).
1. Motivation and Context
In multi-turn video editing scenarios, iterative user refinements necessitate strict cross-consistency across editing steps. Conventional video-to-video diffusion editors conditioned solely on the current input often exhibit 3D inconsistency, temporal drift, and loss of geometric or color coherence with successive edits. Memory-V2V addresses this by incorporating an explicit visual memory in the form of a cache populated with the latent representations of previously generated videos. Retrieval from this cache uses geometric information about camera movement to support consistent editing. Here, the VideoFOV metric emerges as the geometric scoring function for camera trajectory-based retrieval, enabling the model to condition on the most relevant prior edits (Lee et al., 22 Jan 2026).
2. Video-Level Field of View (VideoFOV) Construction
Given a camera trajectory for a video of frames, the “video-level field of view” is defined as the union of directions on the unit sphere that are visible across all frames in the trajectory: This set is constructed using uniformly sampled directions on the viewing sphere, which are determined to be “in-view” if they project into the frame at time given the pose matrices . This union samples all spatial directions seen throughout the clip, yielding a precise geometric measure of “what has been seen” by a given camera trajectory (Lee et al., 22 Jan 2026).
3. Similarity Metrics for Camera Trajectory Retrieval
To score the relevance between a cached video associated with trajectory and a target trajectory , VideoFOV leverages the set overlap and containment between their respective video-level fields of view:
- The overlap score:
- The contain score:
- The combined VideoFOV similarity:
The combination balances symmetric overlap and asymmetric coverage (containment), reflecting both mutual visibility and the degree to which the candidate covers the target’s FOV. The top- cached videos by this similarity are retrieved for conditioning the current editing step (Lee et al., 22 Jan 2026).
4. Functional Role within Memory-Augmented Editing
Within Memory-V2V, the VideoFOV metric is utilized during the retrieval phase of the memory cache pipeline for novel-view video synthesis tasks. After storing latent codes and metadata (including camera trajectories) of past edits in an external cache, retrieval is cast as a ranking problem over the cache using the VideoFOV similarity. This geometric grounding substantially enhances cross-consistency for pose-driven video synthesis, as the model can directly utilize relevant previously seen viewpoints, mitigating common failure modes such as geometric drift and appearance hallucination (Lee et al., 22 Jan 2026).
5. Empirical Impact and Usage
Quantitative evaluation in the Memory-V2V framework demonstrates that the VideoFOV-based retrieval, underpinning memory conditioning, leads to measurable improvements in multi-view consistency. For example, Memory-V2V achieves the lowest average multi-view inconsistency (0.1357 vs. 0.1485 for the best baseline) and reduced camera pose errors. Experimental baselines employ direct comparisons using VideoFOV for retrieval versus single-turn or autoregressive approaches, illustrating substantial consistency gains in iterative novel-view synthesis (Lee et al., 22 Jan 2026).
| Use Case | Retrieval Signal | Effect |
|---|---|---|
| Novel-view synthesis | VideoFOV metric | Improves cross-iteration consistency, reduces pose drift |
| Long-video editing | DINOv2 descriptor | Not directly using VideoFOV, relies on semantic matching |
6. Limitations and Prospects
The VideoFOV metric, as formulated, assumes continuous scenes and well-defined camera trajectories. Its effectiveness may degrade for videos with abrupt shot changes or non-standard camera metadata. A plausible implication is that future work could improve geometric retrieval by integrating learnable FOV scorers or by combining geometric and visual descriptors at retrieval time. Potential broader applications of the VideoFOV framework include general video retrieval, temporal correspondence, and memory distillation for long-form video editing, especially as dataset complexity and diversity increase (Lee et al., 22 Jan 2026).
7. Related Memory and Retrieval Metrics
While VideoFOV is central to Memory-V2V’s geometric retrieval for multi-view consistency, other memory paradigms such as those in Shortcut-V2V exploit temporal redundancy reduction but do not utilize explicit geometric metrics for retrieval. Shortcut-V2V, for instance, approximates frame features from previous frames for compression, focusing on efficiency rather than explicit cross-view or cross-edit consistency modeled by fields of view (Chung et al., 2023). This distinction highlights the unique role of VideoFOV in geometrically principled memory retrieval and conditioning for iterative, viewpoint-driven video editing tasks.