Extend Perceptio from static images to video with temporally consistent depth tokens and object tracking

Develop an extension of the Perceptio vision–language model to process video by generating temporally consistent sequences of VQ‑VAE depth tokens and performing object tracking across frames, and address the associated optimization challenges introduced by temporal consistency and tracking.

Background

Perceptio is introduced as a perception‑enhanced large vision–LLM that generates in‑sequence 2D semantic‑segmentation tokens and discretized 3D depth tokens before emitting text, enabling explicit spatial reasoning. All training and evaluation in the paper are conducted on static images.

The authors identify a key limitation regarding temporal settings: extending the approach to video requires temporally consistent depth token generation and object tracking, which introduces new optimization challenges. They explicitly state that this extension remains open, highlighting it as an unresolved direction for future research.

References

Second, our training and evaluation are limited to static images; extending to video, where temporally consistent depth tokens and object tracking introduce new optimization challenges, remains open.

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation  (2603.18795 - Li et al., 19 Mar 2026) in Discussion and Conclusion — Limitations paragraph