Papers
Topics
Authors
Recent
Search
2000 character limit reached

SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow

Published 17 Apr 2024 in cs.CV | (2404.11426v3)

Abstract: Increasing the annotation efficiency of trajectory annotations from videos has the potential to enable the next generation of data-hungry tracking algorithms to thrive on large-scale datasets. Despite the importance of this task, there are currently very few works exploring how to efficiently label tracking datasets comprehensively. In this work, we introduce SPAM, a video label engine that provides high-quality labels with minimal human intervention. SPAM is built around two key insights: i) most tracking scenarios can be easily resolved. To take advantage of this, we utilize a pre-trained model to generate high-quality pseudo-labels, reserving human involvement for a smaller subset of more difficult instances; ii) handling the spatiotemporal dependencies of track annotations across time can be elegantly and efficiently formulated through graphs. Therefore, we use a unified graph formulation to address the annotation of both detections and identity association for tracks across time. Based on these insights, SPAM produces high-quality annotations with a fraction of ground truth labeling cost. We demonstrate that trackers trained on SPAM labels achieve comparable performance to those trained on human annotations while requiring only $3-20\%$ of the human labeling effort. Hence, SPAM paves the way towards highly efficient labeling of large-scale tracking datasets. We release all models and code.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Efficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018.
  2. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  3. A. Andriyenko and K. Schindler. Multi-target tracking by continuous energy minimization. In CVPR, pages 1265–1272, 2011.
  4. Self-supervised multi-object tracking with cross-input consistency. Advances in Neural Information Processing Systems, 34:13695–13706, 2021.
  5. The power of ensembles for active learning in image classification. In CVPR, pages 9368–9377, 2018.
  6. Multiple object tracking using k-shortest paths optimization. IEEE TPAMI, 33(9):1806–1819, 2011.
  7. Tracking without bells and whistles. In ICCV, pages 941–951, 2019.
  8. Multi-object tracking and segmentation via neural message passing. IJCV, 130(12):3035–3053, 2022.
  9. G. Braso and L. Leal-Taixe. Learning a neural solver for multiple object tracking. In CVPR, 2020.
  10. Memot: Multi-object tracking with memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8090–8100, 2022.
  11. Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 2020.
  12. Unifying short and long-term tracking with graph hierarchies. In CVPR, pages 22877–22887, June 2023.
  13. Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In 2018 IEEE international conference on multimedia and expo (ICME), pages 1–6. IEEE, 2018.
  14. A simple framework for contrastive learning of visual representations. In IEEE Int. Conf. Mach. Learn., pages 1597–1607. PMLR, 2020.
  15. P. Chu and H. Ling. Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In ICCV, October 2019.
  16. Video annotation for visual tracking via selection and refinement. In ICCV, 2021.
  17. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1601–1610, June 2021.
  18. Motchallenge: A benchmark for single-camera multiple target tracking. International Journal of Computer Vision, 129(4):845–881, 2021.
  19. Mot20: A benchmark for multi object tracking in crowded scenes. ArXiv, abs/2003.09003, 2020.
  20. Pedestrian detection: A benchmark. In CVPR, pages 304–311. IEEE, 2009.
  21. Not all labels are equal: Rationalizing the labeling costs for training object detection. In CVPR, pages 14492–14501, 2022.
  22. A mobile vision system for robust multi-person tracking. In CVPR, pages 1–8, 2008.
  23. Motsynth: How can synthetic data help pedestrian detection and tracking? In ICCV, pages 10849–10859, October 2021.
  24. Learning to detect and track visible and occluded body joints in a virtual world. In ECCV, 2018.
  25. Detect to track and track to detect. In ICCV, Oct 2017.
  26. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, pages 4340–4349, 2016.
  27. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  28. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  29. Joint monocular 3d vehicle detection and tracking. In ICCV, 2019.
  30. Space-time correspondence as a contrastive random walk. Advances in neural information processing systems, 33:19545–19560, 2020.
  31. A linear programming approach for multiple object tracking. In CVPR, pages 1–8, 2007.
  32. Framework for performance evaluation for face, text and vehicle detection and tracking in video: data, metrics, and protocol. IEEE TPAMI, 2009.
  33. Label, verify, correct: A simple few-shot object detection method. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  34. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  35. Segment anything. In ICCV, 2023.
  36. Vision transformers are good mask auto-labelers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
  37. Learning by tracking: Siamese cnn for robust target association. In CVPRW, June 2016.
  38. Learning an image-based motion context for multiple people tracking. In CVPR, June 2014.
  39. Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker. In Int. Conf. Comput. Vis. Worksh., pages 120–127, 2011.
  40. Heterogeneous diversity driven active learning for multi-object tracking. In ICCV, pages 9932–9941, 2023.
  41. Heterogeneous diversity driven active learning for multi-object tracking. In ICCV, 2023.
  42. Guiding pseudo-labels with uncertainty estimation for source-free unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  43. Uncertainty-aware unsupervised multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9996–10005, 2023.
  44. Hota: A higher order metric for evaluating multi-object tracking. IJCV, 129(2):548–578, 2021.
  45. Pathtrack: Fast trajectory annotation with path supervision. In Proceedings of the IEEE International Conference on Computer Vision, pages 290–299, 2017.
  46. Trackformer: Multi-object tracking with transformers. In IEEE Conf. Comput. Vis. Pattern Recog., 2022.
  47. Tracking without label: Unsupervised multiple object tracking via contrastive similarity learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16264–16273, 2023.
  48. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
  49. Quasi-dense similarity learning for multiple object tracking. In CVPR, pages 164–173, 2021.
  50. You’ll never walk alone: Modeling social behavior for multi-target tracking. In ICCV, pages 261–268, 2009.
  51. Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR, pages 1201–1208, 2011.
  52. Motiontrack: Learning robust short-term and long-term motions for multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17939–17948, 2023.
  53. Performance measures and a data set for multi-target, multi-camera tracking. In Eur. Conf. Comput. Vis. Worksh., pages 17–35. Springer, 2016.
  54. E. Ristani and C. Tomasi. Features for multi-target multi-camera tracking and re-identification. In CVPR, June 2018.
  55. Learning social etiquette: Human trajectory understanding in crowded scenes. In ECCV, 2016.
  56. P. Scovanner and M. F. Tappen. Learning pedestrian dynamics from the real world. In 2009 IEEE 12th International Conference on Computer Vision, pages 381–388. IEEE, 2009.
  57. Simple cues lead to a strong multi-object tracker. In CVPR, pages 13813–13823, 2023.
  58. O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In ICLR, 2018.
  59. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123, 2018.
  60. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In CVPR, 2022.
  61. Simultaneous detection and tracking with motion modelling for multiple object tracking. In ECCV, pages 626–643, 2020.
  62. Mots: Multi-object tracking and segmentation. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 7942–7951, 2019.
  63. Reducing the annotation effort for video object segmentation datasets. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3060–3069, 2021.
  64. C. Vondrick and D. Ramanan. Video annotation and tracking with active learning. NeurIPS, 24, 2011.
  65. Efficiently scaling up video annotation with crowdsourced marketplaces. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pages 610–623. Springer, 2010.
  66. Tracking emerges by colorizing videos. In Proceedings of the European conference on computer vision (ECCV), pages 391–408, 2018.
  67. Learning correspondence from the cycle-consistency of time. In CVPR, 2019.
  68. Joint object detection and multi-object tracking with graph neural networks. In IEEE Int. Conf. Robotics and Autom., pages 13708–13715, 2021.
  69. Towards real-time multi-object tracking. The European Conference on Computer Vision (ECCV), 2020.
  70. Aligning pretraining for detection via object-level contrastive learning. In Advances in Neural Information Processing Systems, 2021.
  71. Joint detection and identification feature learning for person search. In CVPR, pages 3415–3424, 2017.
  72. Who are you with and where are you going? In CVPR, pages 1345–1352, 2011.
  73. Utm: A unified multiple object tracking model with identity-aware feature enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21876–21886, 2023.
  74. Active learning for deep visual tracking. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  75. Motr: End-to-end multiple-object tracking with transformer. In European Conference on Computer Vision, pages 659–675. Springer, 2022.
  76. Global data association for multi-object tracking using network flows. In CVPR, 2008.
  77. Citypersons: A diverse dataset for pedestrian detection. In CVPR, pages 3213–3221, 2017.
  78. Bytetrack: Multi-object tracking by associating every detection box. In ECCV, pages 1–21. Springer, 2022.
  79. Fairmot: On the fairness of detection and re-identification in multiple object tracking. IJCV, 129(11):3069–3087, 2021.
  80. Person re-identification in the wild. In CVPR, pages 1367–1376, 2017.
  81. Tracking objects as points. In ECCV, pages 474–490. Springer, 2020.

Summary

  • The paper introduces SPAM, a labeling engine that combines synthetic pre-training, pseudo-labeling, and active learning to drastically reduce annotation effort.
  • The paper demonstrates a graph-based approach that efficiently models spatiotemporal dependencies, yielding competitive tracker performance with minimal manual input.
  • The paper validates its method on datasets like MOT17, MOT20, and DanceTrack, proving its scalability and robustness in diverse tracking scenarios.

Enhancing Annotation Efficiency for Multi-Object Tracking Datasets with SPAM: A Synthetic Pre-training and Active Learning Model

Introduction

The paper introduces SPAM, a robust engine for labeling tracking datasets, a crucial component for developing accurate multi-object tracking (MOT) systems. The need for efficient labeling is heightened by the resource-heavy nature of annotating large-scale video datasets. SPAM leverages synthetic pre-training (S) and a blend of pseudo-labeling (P) and active learning (A) applied within a graph-based model (M) hierarchy. This combination enables high-quality labels with minimal human intervention and claims a reduction in labeling effort to only 3-20% of what is typically required.

Graph-based Labeling Model

Theoretical Foundation

The SPAM engine is built around a unified graph formulation, focusing mainly on efficiently annotating detections and identity associations across multiple frames. This graph approach capitalizes on the recurrent patterns in tracking scenarios, allowing for the automation of simpler cases and highlighting complex scenarios where human input is necessary.

  1. Graph Formulation Insight: The spatiotemporal dependencies inherent in video data are modeled as a graph. The nodes represent detection while the edges represent identity associations, facilitating a structured way to handle the temporal linkage between detected objects across frames.
  2. Dual Aspect Utilization: The model uses synthetic pre-trained data to generate pseudo-labels and employs active learning strategies to prioritize human annotators' efforts on more challenging annotation tasks, thereby maximizing the label quality and minimizing effort.

Key Components and Functionality

The SPAM engine consists of:

  • Synthetic Pre-training: Model pre-trained on a synthetic dataset to handle the majority of straightforward tracking scenarios, eliminating the initial need for large-scale hand-annotated datasets.
  • Pseudo-labeling and Active Learning: Further reduction in human intervention by using model-inferred predictions to label new data and refining this process through active learning, focusing human efforts only on uncertain or complex cases.
  • Graph-based Model: Utilizes a hierarchical graph neural network (GNN) architecture to manage the complex relationships between objects over time, enhancing the understanding and predictions of the model regarding object movement and interaction in videos.

Evaluation and Results

Dataset and Metrics

The efficacy of SPAM was evaluated across three challenging tracking datasets: MOT17, MOT20, and DanceTrack, with metrics focusing on accuracy (HOTA), trajectory coverage (MOTA), and identity preservation (IDF1).

Main Findings

  1. Reduced Labeling Effort: Trackers trained on SPAM-generated labels required significantly less human labeling effort—only 3-20% compared to traditional methods.
  2. Competitive Performance: The performance of trackers using SPAM's labels was comparable to those trained with fully human-annotated datasets.
  3. Efficiency Across Datasets: The model demonstrated robustness across various datasets, handling diverse challenges from dense crowds to complex motion patterns in dance sequences.

Future Implications and Research Directions

The introduction of SPAM marks a significant stride towards more sustainable and scalable approaches for annotating multi-object tracking datasets. Looking ahead:

  • Scalability: Further research could explore the scalability of SPAM to other forms of video data beyond pedestrian tracking, such as vehicle or animal movement.
  • Integration with Other AI Technologies: The potential integration of SPAM with other AI-driven technologies, like reinforcement learning or advanced scene comprehension, could push the boundaries of autonomous video analysis systems.
  • Enhanced Active Learning Strategies: Future iterations could develop more sophisticated active learning algorithms that further optimize the balance between automation and necessary human intervention.

Conclusion

SPAM presents a potent combination of synthetic data utilization, advanced pseudo-labeling techniques, and strategic active learning within a graph-based framework. It drastically reduces the need for extensive human input while maintaining high quality in labeled data, potentially revolutionizing the way tracking datasets are annotated and utilized in developing MOT systems. This approach not only amplifies efficiency but also sets a foundational structure for future explorations in automated video dataset annotations.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 51 likes about this paper.