Temporal Complementary Learning for Video Person Re-Identification

Published 18 Jul 2020 in cs.CV | (2007.09357v1)

Abstract: This paper proposes a Temporal Complementary Learning Network that extracts complementary features of consecutive video frames for video person re-identification. Firstly, we introduce a Temporal Saliency Erasing (TSE) module including a saliency erasing operation and a series of ordered learners. Specifically, for a specific frame of a video, the saliency erasing operation drives the specific learner to mine new and complementary parts by erasing the parts activated by previous frames. Such that the diverse visual features can be discovered for consecutive frames and finally form an integral characteristic of the target identity. Furthermore, a Temporal Saliency Boosting (TSB) module is designed to propagate the salient information among video frames to enhance the salient feature. It is complementary to TSE by effectively alleviating the information loss caused by the erasing operation of TSE. Extensive experiments show our method performs favorably against state-of-the-arts. The source code is available at https://github.com/blue-blue272/VideoReID-TCLNet.

Abstract PDF Upgrade to Chat

Citations (95)

View on Semantic Scholar

Summary

The paper presents TCLNet, a novel framework that leverages temporal saliency erasing and boosting modules to extract and propagate discriminative video features.
It employs the TSE module to iteratively remove redundant salient parts from successive frames, ensuring unique feature extraction across the video.
Experiments on benchmarks like MARS demonstrate that TCLNet outperforms traditional methods, achieving a mAP of 83.0% and top-1 accuracy of 88.8%.

Temporal Complementary Learning for Video Person Re-Identification

The paper "Temporal Complementary Learning for Video Person Re-Identification" presents an advanced approach to enhance the performance of video person re-identification tasks by leveraging spatial-temporal information through a novel Temporal Complementary Learning Network (TCLNet). This approach seeks to address the redundancies in existing methodologies by exploiting the temporal depth inherent in video data to mine complementary features and propagate salient information effectively.

Methodology

The authors introduce two key components within TCLNet: the Temporal Saliency Erasing (TSE) module and the Temporal Saliency Boosting (TSB) module.

Temporal Saliency Erasing Module: The TSE module is engineered to extract complementary features across successive video frames. It employs a series of ordered learners alongside a saliency erasing operation that iteratively removes already detected salient parts from preceding frames. This operation ensures that subsequent frames focus on discovering new discriminative parts, fostering a comprehensive visual representation of the target identity. The mechanism improves the granularity of the learned features by preventing multiple frames from redundantly concentrating on the same features, a common challenge in traditional approaches.

Temporal Saliency Boosting Module: While TSE mitigates redundancy, it may inadvertently reduce the emphasis on the most salient features. To counteract this, the TSB module propagates salient information across frames, enhancing the representation without considerable information loss. Utilizing a query-memory attention framework, TSB reinforces salient features by incorporating relevant cues from the entirety of the video sequence.

Experimental Evaluation

The methodology was evaluated on several large-scale benchmarks including MARS, DukeMTMC-VideoReID, and iLIDS-VID. The results exhibited significant improvements over existing methods, particularly in terms of mean Average Precision (mAP) and top-1 accuracy. For instance, TCLNet achieved a mAP of 83.0% and top-1 accuracy of 88.8% on the MARS dataset, outperforming several state-of-the-art approaches that leverage either single image augmentation or traditional temporal models such as recurrent layers and 3D convolutions.

Implications and Future Directions

The implications of this research are multifaceted. Practically, TCLNet can substantially enhance the accuracy of person re-identification systems useful in security and surveillance applications, thus providing more reliable identification under varying conditions and occlusions. Theoretically, the paper introduces a methodology that balances the extraction of complementary features with retaining salient information, a paradigm that can be adapted to other fields involving sequential data analysis.

Looking forward, the work opens up potential research trajectories involving more complex temporal modeling over longer sequences and the integration of techniques to manage noisy data, such as frames compromised by occlusion or poor image quality. A further exploration in combining TCLNet with other machine learning and computer vision technologies could yield improvements across different domains like autonomous vehicles and human activity recognition.

In conclusion, "Temporal Complementary Learning for Video Person Re-Identification" advances the field by innovatively addressing redundancy in video frames through the exploitation of spatial-temporal cues, achieving superior performance metrics over existing state-of-the-art solutions.