- The paper presents an adaptive multi-tracklet tracking framework that robustly handles noisy detections from unseen object categories.
- It employs density-based clustering and hybrid segmentation to improve detection association and suppress identity switches.
- Experimental results on GMOT benchmarks demonstrate state-of-the-art performance under normal, one-shot, and zero-shot detection protocols.
Multi-Tracklet Tracking for Generic Targets with Adaptive Detection Clustering
Problem Statement and Motivation
Generic Multiple Object Tracking (GMOT) extends standard MOT to unconstrained visual domains, where object categories may be unseen or not predefined and detector performance degrades due to lower prior knowledge. This scenario presents substantial challenges: high false positive and negative rates, as well as degraded motion and appearance cues, increase fragmentation and identity switches, leading to poor tracking robustness. Existing solutions that aim to improve detector generalization with large-scale, multimodal datasets are costly and not easily transferable. Therefore, the focus of this work is on robust data association under severe detection uncertainty, by leveraging tracklet-based aggregation and multi-hypothesis tracking (MHT) principles to break the trade-off between accuracy and efficiency.
Figure 1: Detection failures are far more frequent for unseen categories, resulting in lower-confidence and missing proposals compared to categories seen during detector training.
Technical Contributions
The proposed Multi-Tracklet Tracking (MTT) framework incorporates several key innovations:
- Adaptive Tracklet Generation: Detection sets are adaptively clustered into tracklets using a hybrid strategy that segments sequences at points indicated by abrupt changes in detection counts and confidence, as inferred from the detector output. This provides resilience to occlusions and scene transitions by localizing potential ambiguities within short segments.
- Density-Based Clustering: Inside these segments, detections are flexibly partitioned using DBSCAN, exploiting both spatial locality and, optionally, feature similarity. This clustering reduces the computation complexity of following association by confining candidate matches to locally dense regions.
Figure 2: Layered graph model representing inter-frame detection correlation for formulating multi-dimensional assignment and tracklet association.
Figure 4: Example segmentation of a sequence into variable-length subsequences based on detector-reported abrupt variations.
Figure 6: Within-partition clustering using density-based algorithms to group detection proposals by proximity.
- Tracklet-Based MHT: Tracklets (multi-frame detection aggregates) become the primitives ("leaf nodes") of track trees in the MHT paradigm. The framework tracks global hypotheses using both motion and appearance cues with tracklet-level scoring.
- Sophisticated Scoring: Each tracklet is embedded with a temporally aggregated feature, and its tracklet-tree branch likelihood is recursively estimated via a weighted combination of log-likelihood scores: motion (Kalman filter residuals), feature similarity, and detection confidence.
- Tracklet Graph Optimization: The final trajectory hypotheses are globally optimized by converting the tree association problem into a Maximum-Weight Independent Set (MWIS) problem over an undirected graph, ensuring optimal non-overlapping selection of trajectory branches.
Figure 7: Overall workflow for the MTT system, integrating adaptive segmentation, clustering, and tracklet-based hypothesis management.
Experimental Results
Experiments on GMOT-40, a public benchmark for GMOT covering multiple categories and conditions, provide comprehensive evaluation:
- Tracklet Generation Trade-offs: Fixed and adaptive segment lengths were compared. Increasing segment size improves tracklet continuity and feature strength but raises optimization problem size, yielding diminishing returns (as measured by pass rates and computation time). Adaptive strategies maintain high pass rates and efficiency while extracting discriminative features.
Figure 3: Tracking runtime as a function of the chosen window size in the variable-length windowed tracklet generation.
- Tracklet Feature Discriminativity: t-SNE analysis demonstrates that tracklet-aggregated features are more separable in embedding space than frame-level detections, which is further verified by similarity histograms—tracklet aggregation suppresses inter-class confusion that is rampant in detection-level representations.
Figure 5: t-SNE embedding of both frame-wise detection features and tracklet features; tracklets (triangles) show improved identity separability.
Figure 11: Histograms comparing pair-similarity distributions—tracklets enhance discrimination compared to per-detection features.
- Benchmark Protocols: MTT is assessed under three protocols—normal, one-shot, and zero-shot detection:
- In normal detection (with supervised detectors such as DETR-Siamese), MTT achieves top-2 MOTA and IDF1, outperforming or closely matching ByteTrack, BoT-SORT, and TbQ.
- In one-shot detection (with template-based detector GlobalTrack), MTT registers a notable increase in both MOTA and IDF1 over other methods, including ByteTrack and BoT-SORT.
- In the zero-shot protocol (with vision-language detectors like GLIP), MTT leads in MOTA, total errors, and identity switches, demonstrating superior robustness to noisy or under-trained detector output.
- Under perfect (GT) detection, methods such as standard MHT maintain an edge, affirming that segmentation impacts information completeness, but the adaptive window mitigates most losses.
Quantitative tables in the original text summarize these results; MTT is consistently among the top performers across all protocols.
Limitations and Open Issues
Despite high comparative performance, MTT—like all current solutions—faces major hurdles in generic tracking:
- Fundamental Reliance on Detection: Tracklet-based association cannot fully offset the high rate of false negatives and false positives when the detector is not reliable. MOTA and IDF1 remain moderate and future improvement is contingent on more robust general-purpose object detectors.
- Dynamic Object Modeling Constraints: High-dynamics, e.g., biological swarms, defeat simplistic motion models. Reliance on a fixed motion process limits performance in these regimes.
- Pruning and Hypothesis Explosion: While tracklet aggregation reduces the combinatorial burden, computational load remains nontrivial. The pruning process could benefit from more selective tree construction, adaptive pruning depth, and early elimination of redundant branches for highly uncertain targets.
Broader Implications and Future Directions
Practically, MTT provides a template for efficient multi-target tracking in regimes characterized by detector uncertainty, absence of category priors, and dynamically varying scene composition—common in robotics, biological studies, and open-world video analysis. The explicit balancing of tracklet segment size, adaptive partitioning, and density-based local association is generalizable and may serve as a foundation for group-level or parallelized tracking systems.
Theoretical implications include progress toward scalable NP-hard optimization in assignment settings, and improved understanding of the interaction between appearance/motion cues at different aggregation levels.
Probable next steps include:
- Integration of online feature refinement to improve re-identification capacity during tracking.
- Exploration of multi-modal priors that remain adaptable and low-cost.
- Development of more advanced pruning methods that adaptively control computational complexity.
Conclusion
The MTT framework skillfully unifies adaptive tracklet generation, density-based proposal clustering, and an MHT-based global association strategy to address the principal weaknesses of GMOT under uncertain detections. Comprehensive benchmarking confirms state-of-the-art or near state-of-the-art accuracy and efficiency in multiple protocols. While advances in general-purpose detectors remain essential, adaptive tracklet-based tracking, as formalized here, will play a key role in the deployment of practical, robust trackers for real-world, open-category video analysis.