Extend Perceptio from static images to video with temporally consistent depth tokens and object tracking
Develop an extension of the Perceptio vision–language model to process video by generating temporally consistent sequences of VQ‑VAE depth tokens and performing object tracking across frames, and address the associated optimization challenges introduced by temporal consistency and tracking.
References
Second, our training and evaluation are limited to static images; extending to video, where temporally consistent depth tokens and object tracking introduce new optimization challenges, remains open.
— Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation
(2603.18795 - Li et al., 19 Mar 2026) in Discussion and Conclusion — Limitations paragraph