Modality-agnostic robust tracking under missing or degraded modalities

Develop a unified visual object tracking algorithm that maintains high accuracy and temporal consistency across RGB-only, RGB-plus-auxiliary (e.g., depth, thermal, event), and auxiliary-only inputs, and that remains resilient to missing or degraded modalities caused by sensor failure, occlusion, or bandwidth constraints.

Background

The paper reviews the progression from RGB-only tracking to multimodal and unified tracking frameworks that handle RGB-X inputs. It highlights practical issues such as sensor failures and temporary modality dropouts and introduces benchmarks that simulate missing-modality scenarios.

Despite advances, achieving robustness across varying modality availability—degrading gracefully when certain sensors are absent—remains difficult. The authors emphasize that real-world deployments demand trackers that function reliably across RGB-only, RGB-X, and X-only conditions.

References

Consequently, designing a tracker that is accurate, modality-agnostic, and resilient to missing or degraded inputs continues to be a critical open problem in visual object tracking.

Video Understanding: From Geometry and Semantics to Unified Models  (2603.17840 - An et al., 18 Mar 2026) in Summary paragraph, Section 3.2 (Video object tracking)