Tight integration of geometry and semantics for video understanding

Establish methods that fully leverage geometric constraints—such as depth, camera pose, and correspondences—to stabilize high-level semantic reasoning, and conversely use semantic cues—such as object identity and interactions—to guide geometry estimation in dynamic and ambiguous video scenes, thereby achieving tighter integration between geometry and semantics in video understanding models.

Background

Unified video models increasingly require both physically grounded geometry and high-level semantics. The survey argues for world-model-like systems that perceive and predict, and notes that generation adds constraints on physical plausibility and instruction faithfulness.

Although joint representations are emerging, the paper notes that robustly using geometry to support semantics, and semantics to improve geometry, especially in dynamic or ambiguous conditions, is not yet solved.

References

While recent models demonstrate promising joint representations, fully leveraging geometric constraints to stabilize semantic reasoning, and conversely using semantic cues to guide geometry in dynamic and ambiguous scenes, remains an open problem.

Video Understanding: From Geometry and Semantics to Unified Models  (2603.17840 - An et al., 18 Mar 2026) in First outlook point, Section 6 (Conclusion and Outlook)