Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking

Published 19 Jul 2016 in cs.CV | (1607.05781v1)

Abstract: In this paper, we develop a new approach of spatially supervised recurrent convolutional neural networks for visual object tracking. Our recurrent convolutional network exploits the history of locations as well as the distinctive visual features learned by the deep neural networks. Inspired by recent bounding box regression methods for object detection, we study the regression capability of Long Short-Term Memory (LSTM) in the temporal domain, and propose to concatenate high-level visual features produced by convolutional networks with region information. In contrast to existing deep learning based trackers that use binary classification for region candidates, we use regression for direct prediction of the tracking locations both at the convolutional layer and at the recurrent unit. Our extensive experimental results and performance comparison with state-of-the-art tracking methods on challenging benchmark video tracking datasets shows that our tracker is more accurate and robust while maintaining low computational cost. For most test video sequences, our method achieves the best tracking performance, often outperforms the second best by a large margin.

Abstract PDF Upgrade to Chat

Citations (240)

View on Semantic Scholar

Summary

The paper presents ROLO, a novel recurrent convolutional approach that integrates YOLO and LSTM for enhanced tracking accuracy.
It leverages spatial supervision through a heatmap-based method combined with LSTM's temporal processing to handle occlusions and motion blur.
The system enables efficient end-to-end training with low computational complexity, consistently outperforming existing state-of-the-art trackers.

Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking

The paper proposes an innovative approach to visual object tracking using a recurrent convolutional neural network (RCNN) framework, aptly termed ROLO (Recurrent YOLO). This framework integrates spatially-supervised learning into both the spatial and temporal domains, leveraging Long Short-Term Memory (LSTM) units to enhance tracking robustness and accuracy. This method diverges from traditional tracking paradigms such as Kalman filters by introducing a "doubly deep" mechanism which considers both historic location data and robust visual features from past frames.

The proposed system utilizes YOLO (You Only Look Once) for rich feature collection and preliminary location inference, which are subsequently processed by LSTM to generate accurate bounding box predictions. The central contribution of this research lies in the seamless combination of these components, yielding a tracker that is proficient in real-time object tracking tasks even under challenging conditions such as occlusions and motion blur.

Key Contributions and Methodology

Integration of Spatial and Temporal Domains: The ROLO architecture extends the functionality of convolutional neural networks (ConvNets) by incorporating LSTM units, enabling efficient processing of spatiotemporal data. This integration is critical for accurately predicting object locations across sequential video frames.
Efficient End-to-End Training: The modular design of the network permits end-to-end training facilitated by gradient-based learning methods. This approach not only improves performance but also ensures low computational complexity, making it feasible for real-time tracking applications.
Heatmap-based Spatial Supervision: In addition to using direct coordinate regression, the research introduces an alternative heatmap method, providing a more interpretable representation of tracking predictions. This method also facilitates robust performance during occlusion by leveraging spatial distributions rather than discrete coordinates.

Experimental Results

The authors have conducted extensive empirical evaluations, benchmarking ROLO against existing state-of-the-art tracking methods across challenging video sequences. The results demonstrate that ROLO consistently outperforms its counterparts in terms of tracking accuracy and robustness, often by a significant margin. The system operates efficiently, maintaining high frame rates which are crucial for real-time applications.

Implications and Future Work

The integration of LSTM with ConvNets represents a significant step forward in object tracking, offering enhanced interpretability and robustness to complex video dynamics. The successful application of spatio-temporal regression in this setting opens avenues for further research into multi-target tracking and online learning adaptations.

Looking ahead, the authors suggest potential enhancements such as optimizing cost functions through stacked LSTMs and refining online learning techniques to adapt to unseen dynamics in real-time environments. Furthermore, expanding ROLO's application to multi-target scenarios through advanced data association processes would represent a natural progression of this work.

Overall, this paper contributes valuable insights into the capabilities of recurrent deep learning models for video object tracking, setting the stage for future innovations in the field of computer vision.