Papers
Topics
Authors
Recent
Search
2000 character limit reached

Explicit Visual Prompts for Visual Object Tracking

Published 6 Jan 2024 in cs.CV | (2401.03142v1)

Abstract: How to effectively exploit spatio-temporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the \textit{when-and-how-to-update} dilemma. To address these issues, we propose a novel explicit visual prompts framework for visual tracking, dubbed \textbf{EVPTrack}. Specifically, we utilize spatio-temporal tokens to propagate information between consecutive frames without focusing on updating templates. As a result, we cannot only alleviate the challenge of \textit{when-to-update}, but also avoid the hyper-parameters associated with updating strategies. Then, we utilize the spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. The prompts are fed into a transformer encoder together with the image tokens without additional processing. Consequently, the efficiency of our model is improved by avoiding \textit{how-to-update}. In addition, we consider multi-scale information as explicit visual prompts, providing multiscale template features to enhance the EVPTrack's ability to handle target scale changes. Extensive experimental results on six benchmarks (i.e., LaSOT, LaSOT\rm $_{ext}$, GOT-10k, UAV123, TrackingNet, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a real-time speed by effectively exploiting both spatio-temporal and multi-scale information. Code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Fully-Convolutional Siamese Networks for Object Tracking. In ECCV Workshops, 850–865.
  2. TCTrack: Temporal Contexts for Aerial Tracking. In CVPR, 14778–14788.
  3. Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. In ECCV (22), 375–392.
  4. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. CVPR, abs/2304.14394.
  5. Transformer Tracking. In CVPR, 8126–8135.
  6. Siamese Box Adaptive Network for Visual Tracking. In CVPR, 6667–6676.
  7. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In CVPR, 13598–13608.
  8. ECO: Efficient Convolution Operators for Tracking. In CVPR, 6931–6939.
  9. ATOM: Accurate Tracking by Overlap Maximization. In CVPR, 4660–4669.
  10. Probabilistic Regression for Visual Tracking. In CVPR, 7181–7190.
  11. LaSOT: A High-quality Large-scale Single Object Tracking Benchmark. Int. J. Comput. Vis., 439–461.
  12. LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In CVPR, 5374–5383.
  13. STMTrack: Template-Free Visual Tracking With Space-Time Memory Networks. In CVPR, 13774–13783.
  14. AiATrack: Attention in Attention for Transformer Visual Tracking. In ECCV (22), 146–164.
  15. Masked Autoencoders Are Scalable Vision Learners. In CVPR, 15979–15988.
  16. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell., 43(5): 1562–1577.
  17. SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks. In CVPR, 4282–4291.
  18. High Performance Visual Tracking With Siamese Region Proposal Network. In CVPR, 8971–8980.
  19. Focal Loss for Dense Object Detection. In ICCV, 2999–3007.
  20. Microsoft COCO: Common Objects in Context. In ECCV, 740–755.
  21. Decoupled Weight Decay Regularization. In ICLR.
  22. A Benchmark and Simulator for UAV Tracking. In ECCV, 445–461.
  23. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In ECCV, 310–327.
  24. Learning Multi-domain Convolutional Neural Networks for Visual Tracking. In CVPR, 4293–4302.
  25. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In CVPR, 658–666.
  26. Attention is All you Need. In NIPS, 5998–6008.
  27. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In CVPR, 1571–1580.
  28. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, 1571–1580.
  29. VideoTrack: Learning to Track Objects via Video Transformer. In CVPR, 22826–22835.
  30. Autoregressive Visual Tracking. In CVPR, 9697–9706.
  31. Learning Spatio-Temporal Transformer for Visual Tracking. In ICCV, 10428–10437.
  32. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In ECCV (22), 341–357.
  33. Learning the Model Update for Siamese Trackers. In ICCV, 4009–4018.
  34. HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer. In ICLR.
  35. Learn To Match: Automatic Matching Network Design for Visual Tracking. In ICCV, 13319–13328.
  36. Ocean: Object-Aware Anchor-Free Tracking. In ECCV, 771–787.
  37. Global Tracking via Ensemble of Local Trackers. In CVPR, 8751–8760.
Citations (10)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub