End-to-end people detection in crowded scenes

Published 16 Jun 2015 in cs.CV | (1506.04878v3)

Abstract: Current people detectors operate either by scanning an image in a sliding window fashion or by classifying a discrete set of proposals. We propose a model that is based on decoding an image into a set of people detections. Our system takes an image as input and directly outputs a set of distinct detection hypotheses. Because we generate predictions jointly, common post-processing steps such as non-maximum suppression are unnecessary. We use a recurrent LSTM layer for sequence generation and train our model end-to-end with a new loss function that operates on sets of detections. We demonstrate the effectiveness of our approach on the challenging task of detecting people in crowded scenes.

Abstract PDF Upgrade to Chat

Citations (525)

View on Semantic Scholar

Summary

The paper introduces an end-to-end trainable system using an LSTM decoder with CNN features to directly predict people bounding boxes.
It employs a novel loss function with Hungarian matching to accurately localize overlapping individuals without the need for non-maximum suppression.
The model achieves significant improvements, with an average precision of 0.78 and a marked reduction in counting errors in crowded environments.

End-to-end People Detection in Crowded Scenes

This paper introduces an innovative end-to-end model for the detection of people in crowded visual environments, aiming to address limitations inherent in existing object detection architectures. Current methods often rely on sliding windows or distinct classification of proposal sets, requiring post-processing like non-maximum suppression to handle multiple detections. The proposed model eschews these techniques by harnessing a Long Short-Term Memory (LSTM) layer to generate a coherent and structured sequence of detection predictions directly from input images.

Model Architecture

The architecture integrates an LSTM-based approach for sequence generation, using GoogLeNet image features refined within the system. This design allows for direct conversion of image representations into bounding box predictions, effectively avoiding redundant detections. By improving upon the handling of overlapping instances, the model dynamically predicts bounding boxes and their confidences, decoding the image content without needing post-processing.

The LSTM functions as a decoder, progressively generating output bounding boxes, with the system designed to learn through back-propagation. The use of a recurrent network facilitates the prediction of a variable number of objects, an approach beneficial for detection tasks involving overlapping or occluded objects.

Key Contributions

The primary contributions include:

End-to-End Trainable System: A unified framework that refines all model components during training, enhancing performance over independent prediction systems.
Novel Loss Function: A tailored loss function that combines aspects of localization and detection, designed to ensure predictions are sequentially accurate and confident.
Avoidance of Post-Processing: The integrated system negates the need for non-maximum suppression by managing sequential dependency directly within prediction generation.
Sequence Generation with LSTM: Successful application of LSTM chains for decoding image content into variable-length outputs.

Methodological Insights

The architecture leverages deep convolutional networks to capture high-level image features, which are then decoded through an LSTM mechanism. This blend of CNN encoding with RNN-based decoding results in coherent variable-length prediction sequences.

A unique aspect of the training process involves the Hungarian algorithm to match predictions with ground truths, supporting the model's ability to distinguish distinct object instances effectively. The loss function penalizes inaccuracies in localization and inappropriate duplicate predictions, aiming to improve model accuracy and relevance.

Experimental Results

The model was evaluated on a dataset sourced from crowded scenes, consisting of over 91,000 labeled instances. Impressively, the system outperformed baseline models, including an augmented version of OverFeat using GoogLeNet representation. Key performance metrics improved significantly, with the model achieving an average precision (AP) of 0.78, and a notable reduction in counting errors compared to OverFeat variants.

Future Implications

The results indicate the model's potential applicability in more complex computer vision tasks involving structured outputs, such as multi-person tracking and articulated pose estimation. Future work might explore the application of this approach in diverse visual domains and its integration with other deep learning advancements.

Overall, this paper presents a significant advancement in the detection of occluded and densely packed object instances, offering insights and methodologies that can guide future research in automated visual detection systems.

Markdown Report Issue