Tight integration of geometry and semantics for video understanding
Establish methods that fully leverage geometric constraints—such as depth, camera pose, and correspondences—to stabilize high-level semantic reasoning, and conversely use semantic cues—such as object identity and interactions—to guide geometry estimation in dynamic and ambiguous video scenes, thereby achieving tighter integration between geometry and semantics in video understanding models.
References
While recent models demonstrate promising joint representations, fully leveraging geometric constraints to stabilize semantic reasoning, and conversely using semantic cues to guide geometry in dynamic and ambiguous scenes, remains an open problem.
— Video Understanding: From Geometry and Semantics to Unified Models
(2603.17840 - An et al., 18 Mar 2026) in First outlook point, Section 6 (Conclusion and Outlook)