A Discriminatively Learned CNN Embedding for Person Re-identification

Published 17 Nov 2016 in cs.CV | (1611.05666v2)

Abstract: We revisit two popular convolutional neural networks (CNN) in person re-identification (re-ID), i.e, verification and classification models. The two models have their respective advantages and limitations due to different loss functions. In this paper, we shed light on how to combine the two models to learn more discriminative pedestrian descriptors. Specifically, we propose a new siamese network that simultaneously computes identification loss and verification loss. Given a pair of training images, the network predicts the identities of the two images and whether they belong to the same identity. Our network learns a discriminative embedding and a similarity measurement at the same time, thus making full usage of the annotations. Albeit simple, the learned embedding improves the state-of-the-art performance on two public person re-ID benchmarks. Further, we show our architecture can also be applied in image retrieval.

Abstract PDF Upgrade to Chat

Citations (859)

View on Semantic Scholar

Summary

The paper presents a siamese CNN that integrates identification and verification losses to learn discriminative embeddings for person re-ID.
The approach achieves state-of-the-art performance with a rank-1 accuracy of 79.51% and mAP of 59.87% on the Market1501 benchmark.
The method also demonstrates scalability by maintaining robust performance even with a 500k distractor set and applicability to generic image retrieval tasks.

A Discriminatively Learned CNN Embedding for Person Re-identification

The research paper "A Discriminatively Learned CNN Embedding for Person Re-identification" by Zhedong Zheng, Liang Zheng, and Yi Yang discusses a novel convolutional neural network (CNN) architecture designed to improve the task of person re-identification (re-ID). The paper aims to leverage and combine the strengths of two prevalent models in person re-ID—verification and identification models—by proposing a unified framework that benefits from both.

Background and Motivation

Person re-identification is primarily an image retrieval task, where the objective is to match images of pedestrians across different camera views. In practice, this problem is challenging due to significant variations in pose, illumination, and camera viewpoints. Traditional approaches to address this problem fall into two categories:

Verification Models: These treat re-ID as a binary classification or similarity regression task, where the objective is to determine whether two given images depict the same person.
Identification Models: These re-ID methods treat the task as a multi-class classification problem, where the goal is to predict the identity of a given image from a set of known identities.

Each of these approaches has inherent advantages and limitations. Verification models are effective at measuring similarity but often fail to leverage all annotated data. On the other hand, identification models make fuller use of labeled data but do not explicitly account for pairwise similarity between images at the time of learning. The challenge, and opportunity, lies in effectively combining these approaches to develop a more robust and discriminative embedding for pedestrian descriptors.

Proposed Method

The authors propose a siamese network architecture that simultaneously learns identification and verification tasks. The network architecture integrates both identification loss (to learn the person's identity) and verification loss (to measure the similarity between image pairs). Specifically, for a given pair of images, the network predicts both the individual identities and whether the images depict the same person.

Identification Loss: This is implemented using a convolutional layer followed by a softmax layer that predicts the identity of each input image.
Verification Loss: To compute the similarity between pairs of images, a non-parametric Square Layer is introduced. This layer performs an element-wise subtraction followed by squaring and then applies a convolutional layer and softmax to predict whether the input pair corresponds to the same person.

This combined approach trains the network to develop embeddings that are both discriminative and capable of measuring similarity.

Experimental Results

The proposed method is evaluated on two major person re-ID benchmarks: Market1501 and CUHK03. The experiments reveal several key findings:

Quantitative Improvements: The method shows significant improvements over state-of-the-art techniques. For instance, on Market1501, the proposed model achieves a rank-1 accuracy of 79.51% and a mean average precision (mAP) of 59.87% when using ResNet-50 as the backbone network.
Effectiveness Across Architectures: The combined loss approach enhances performance across different networks (CaffeNet, VGG16, ResNet-50), demonstrating the method's robustness and versatility.
Scalability: When evaluated on Market1501 with an additional 500k distractor images, the method maintains high re-ID performance, highlighting its ability to scale effectively in more extensive datasets.

Additional Contributions

Apart from person re-ID, the paper also discusses applying the proposed model to generic image retrieval tasks, using the Oxford5k buildings dataset as a test case. Here, the model also achieves competitive or superior performance compared to existing methods, demonstrating its general applicability beyond pedestrian re-identification.

Future Work and Implications

The implications of this work are two-fold:

Practical Implications: The ability to effectively combine verification and identification models may lead to more reliable and accurate systems in surveillance, security, and other applications requiring robust person re-identification capabilities.
Theoretical Implications: The success of the combined loss approach encourages further research into hybrid models that leverage multiple types of supervisory signals, potentially improving various computer vision and pattern recognition tasks.

Future research directions suggested by the authors include exploring the method's application to other areas such as fine-grained classification and car recognition, as well as further improving the robustness and scalability of person re-ID models.

In summary, this paper presents a compelling approach to learning discriminative CNN embeddings for person re-identification by integrating verification and identification models. The proposed siamese network demonstrates significant improvements over traditional methods, offering both theoretical insights and practical advancements in the field.

Markdown Report Issue