MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors

Published 17 Nov 2022 in cs.CV | (2211.09791v2)

Abstract: In this paper, we propose MOTRv2, a simple yet effective pipeline to bootstrap end-to-end multi-object tracking with a pretrained object detector. Existing end-to-end methods, MOTR and TrackFormer are inferior to their tracking-by-detection counterparts mainly due to their poor detection performance. We aim to improve MOTR by elegantly incorporating an extra object detector. We first adopt the anchor formulation of queries and then use an extra object detector to generate proposals as anchors, providing detection prior to MOTR. The simple modification greatly eases the conflict between joint learning detection and association tasks in MOTR. MOTRv2 keeps the query propogation feature and scales well on large-scale benchmarks. MOTRv2 ranks the 1st place (73.4% HOTA on DanceTrack) in the 1st Multiple People Tracking in Group Dance Challenge. Moreover, MOTRv2 reaches state-of-the-art performance on the BDD100K dataset. We hope this simple and effective pipeline can provide some new insights to the end-to-end MOT community. Code is available at \url{https://github.com/megvii-research/MOTRv2}.

Abstract PDF Upgrade to Chat

Citations (100)

View on Semantic Scholar

Summary

The paper introduces a novel pipeline that integrates pretrained YOLOX proposals with a Deformable DETR framework to enhance detection in multi-object tracking.
It employs proposal query generation and proposal propagation to effectively combing track and detection cues for robust object tracking.
The approach delivers state-of-the-art results, achieving 73.4% HOTA on DanceTrack and top performance on the BDD100K dataset.

Overview of MOTRv2: Improving Multi-Object Tracking Performance

The paper presents MOTRv2, a novel pipeline designed to enhance end-to-end multi-object tracking (MOT) by leveraging pretrained object detectors. The work addresses the limitations of previous end-to-end methods like MOTR and TrackFormer, specifically their suboptimal detection performance compared to tracking-by-detection approaches. The introduction of an external object detector significantly ameliorates this limitation.

Methodology

MOTRv2 innovates on the existing MOT framework by integrating YOLOX-generated proposals as anchors in the Deformable DETR architecture. This integration involves two main components: proposal query generation and proposal propagation.

Proposal Query Generation: In this stage, YOLOX proposals, including their location and confidence scores, are utilized to initialize proposal queries. This process replaces the learnable detect queries in the original MOTR, providing specific detection cues for newborn or missed objects.
Proposal Propagation: This involves the concatenation of track queries from the previous frame with the current frame's proposal queries. MOTRv2 uses anchor-based modeling to lessen conflicts between detection and association tasks, resulting in a simplified optimization process.

Strong Numerical Results

The empirical evaluation demonstrates significant improvements in tracking accuracy. Notably, MOTRv2 ranks first with a 73.4% HOTA score on the DanceTrack dataset and achieves state-of-the-art performance on the BDD100K dataset. The integration of YOLOX raises both detection and association accuracies, highlighting the efficacy of the proposed modifications.

Implications

Theoretical implications include insights into optimizing detection and tracking within the same framework by decoupling these tasks. Practically, MOTRv2 offers a robust baseline for future research on end-to-end MOT systems, suggesting a path to overcome the traditional performance gaps faced by such systems.

Speculative Future Developments

Future research directions could involve refining anchor propagation techniques and exploring different detectors for enhanced object localization. There is also potential to reduce computational overhead by optimizing the transformer-based processing in MOTR.

MOTRv2 represents an adept step in advancing the utility of end-to-end frameworks for multi-object tracking by synergizing them with traditional object detection models, offering a promising avenue for subsequent research and application in complex MOT scenarios.

Markdown Report Issue