Accurate Single Stage Detector Using Recurrent Rolling Convolution

Published 19 Apr 2017 in cs.CV | (1704.05776v1)

Abstract: Most of the recent successful methods in accurate object detection and localization used some variants of R-CNN style two stage Convolutional Neural Networks (CNN) where plausible regions were proposed in the first stage then followed by a second stage for decision refinement. Despite the simplicity of training and the efficiency in deployment, the single stage detection methods have not been as competitive when evaluated in benchmarks consider mAP for high IoU thresholds. In this paper, we proposed a novel single stage end-to-end trainable object detection network to overcome this limitation. We achieved this by introducing Recurrent Rolling Convolution (RRC) architecture over multi-scale feature maps to construct object classifiers and bounding box regressors which are "deep in context". We evaluated our method in the challenging KITTI dataset which measures methods under IoU threshold of 0.7. We showed that with RRC, a single reduced VGG-16 based model already significantly outperformed all the previously published results. At the time this paper was written our models ranked the first in KITTI car detection (the hard level), the first in cyclist detection and the second in pedestrian detection. These results were not reached by the previous single stage methods. The code is publicly available.

Abstract PDF Upgrade to Chat

Citations (277)

View on Semantic Scholar

Summary

The paper introduces Recurrent Rolling Convolution (RRC) to enhance bounding box precision and achieve high IoU in single-stage object detectors.
It employs recurrent feature aggregation that iteratively refines multi-scale feature maps, enabling end-to-end optimization of both classification and localization.
Experimental results on the KITTI dataset demonstrate significant mAP improvements over SSD, especially in challenging scenarios with small, occluded, or overlapping objects.

An Analysis of Recurrent Rolling Convolution for Single Stage Object Detection

This paper, "Accurate Single Stage Detector Using Recurrent Rolling Convolution," presents a significant advancement in the field of object detection using Convolutional Neural Networks (CNNs). The authors address the limitations faced by single-stage detection networks in achieving high precision at elevated Intersection over Union (IoU) thresholds, which traditional two-stage methods typically handle more effectively. They introduce an innovative architecture known as Recurrent Rolling Convolution (RRC) to bridge this gap.

Context and Problem Statement

In conventional models, two-stage methods like the R-CNN family have demonstrated substantial accuracy improvements by integrating region proposal networks. These models, however, involve more complex training processes and increased computational demands during deployment. While single-stage models such as SSD and YOLO offer simplified training and faster performance, they often fall short in accuracy evaluations above certain IoU thresholds. The authors identify that this shortfall is primarily due to inadequate bounding box precision in complex object scenes, especially those involving small, occluded, or overlapping objects.

Methodology

The proposed solution, RRC, facilitates a more integrated and context-aware feature aggregation process within a single-stage architecture. The authors leverage the concept of recurrent computations to iteratively refine feature maps, thus enabling contextual information to be selectively incorporated into object classifiers and bounding box regressors. This approach capitalizes on multi-scale features and refines detection by recurrently rolling across different scales in the feature maps.

Key elements of the RRC architecture include:

Recurrent Feature Aggregation: The architecture cyclically utilizes feature maps from different network depths, allowing for a collective refinement process "deep in context."
End-to-End Training: The RRC is designed to be trained end-to-end, providing a unified framework that optimizes both classification and localization tasks.
Robust Performance Across IoU Thresholds: By improving the precision of bounding boxes, RRC exhibits robustness in high IoU scenarios.

Experimental Results

The RRC architecture was validated on the challenging KITTI dataset. It demonstrated superior performance over existing single-stage detection methods, notably surpassing earlier publications on the KITTI leaderboard in categories like car, pedestrian, and cyclist detection. Notably, it achieved the top rank for car detection in challenging scenarios, which underscores its utility for high precision applications.

Experimental data highlighted:

Consistent improvement in validation loss across recurrent iterations, pointing to the efficacy of context-aware refinements.
Significant mAP outperformance compared to SSD, particularly when benchmarking against high IoU thresholds (e.g., from 0.7 to 0.8).

Implications and Future Directions

The adoption of RRC marks a pivotal enhancement in marrying the computational advantages of single-stage detectors with the high accuracy typically attributed to two-stage methods. From a theoretical perspective, it showcases the potential of recurrent architectures in visual recognition tasks where iteration and context accumulation can rectify initial localization inaccuracies.

Looking ahead, the exploration of memory-augmented recurrent architectures could further bolster the accuracy, particularly in sequences requiring sustained temporal reasoning, as suggested by the authors. Additionally, extending RRC to three-dimensional object detection tasks offers an enticing opportunity to further its real-world applicability, particularly in autonomous driving and robotics sectors.

In conclusion, the proposed RRC approach addresses a critical bottleneck in single-stage detector performance and offers a framework with the potential for broad application within and beyond object detection tasks. The research contributes a compelling case for revisiting the design of single-stage models to leverage recurrent processing advantages.