Object Detection in Video with Spatiotemporal Sampling Networks

Published 15 Mar 2018 in cs.CV | (1803.05549v2)

Abstract: We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos. Our STSN performs object detection in a video frame by learning to spatially sample features from the adjacent frames. This naturally renders the approach robust to occlusion or motion blur in individual frames. Our framework does not require additional supervision, as it optimizes sampling locations directly with respect to object detection performance. Our STSN outperforms the state-of-the-art on the ImageNet VID dataset and compared to prior video object detection methods it uses a simpler design, and does not require optical flow data for training.

Abstract PDF Upgrade to Chat

Citations (219)

View on Semantic Scholar

Summary

The paper introduces a novel approach using deformable convolutions for adaptive spatiotemporal sampling.
It trains video object detection end-to-end, eliminating the need for separate optical flow mechanisms.
Empirical results on ImageNet VID demonstrate improved mAP, showcasing enhanced accuracy and design simplicity.

Insights into Spatiotemporal Sampling Networks for Video Object Detection

The paper presents an approach to video object detection by leveraging Spatiotemporal Sampling Networks (STSN). The method relies on deformable convolutions to enhance detection robustness across video frames by integrating spatial and temporal information inherently available in videos. This approach responds to challenges such as motion blur, occlusion, and unusual object poses that typically hinder performance when applying image-based detection models directly to video.

Key Contributions

Deformable Convolutions for Temporal Sampling: STSN uses deformable convolutions to dynamically sample features from adjacent video frames in a temporally aware manner. This permits the network to adaptively gather information about objects from various poses and viewpoints which are naturally encapsulated in video sequence data.
End-to-End Training Without Optical Flow: Unlike previous methods that required a separation between optical flow prediction and object detection, STSN is trained end-to-end with video object detection as the singular optimization objective. This obviates the need for complex flow network designs and substantial datasets for flow prediction, which are usually cumbersome to acquire and integrate.
Performance and Design Simplification: Empirical evaluations on the ImageNet VID dataset indicate that STSN outperforms the current state-of-the-art methods, notably achieving superior mean Average Precision (mAP). Importantly, this performance is attained with a design that simplifies both architecture and training process by eschewing external temporal post-processing steps and additional flow data requirements.

Numerical Results

The paper reports significant improvements in detection accuracy. Notably, STSN achieves an mAP of 78.9, marginally surpassing the FGFA method which requires optical flow. Further, when coupled with a simple temporal post-processing step like Seq-NMS, the STSN achieves an mAP of 80.4, indicating its capability to be further enhanced with existing techniques.

Practical and Theoretical Implications

From a practical perspective, the STSN provides a more streamlined approach to video object detection, offering ease of integration into existing pipelines without the baggage of flow network complexity. Theoretically, this work underscores the efficacy of embedding deformable convolutions in temporal modules to effectively exploit the richer information present in videos. It also hints at a broader paradigm shift where future models might benefit from analogous spatiotemporal features without heavily relying on crafted representations like optical flow.

Future Developments

The methodology opens up pathways for future research focused on further refining spatiotemporal sampling mechanisms. Researchers are likely to explore leveraging more advanced architectures or exploring alternative loss functions fine-tuned for specific video object detection tasks. There's substantial potential in extending these concepts to jointly improve tracking and detection tasks or adapting their successes back to static image domains.

In conclusion, by innovating on the deformable architectures to concurrently model space and time dimensions, the paper provides a compelling solution for enhanced video object detection. The STSN not only simplifies design complexities but also delivers robust performance, marking a step forward in the domain of intelligent video analysis.

Markdown Report Issue