Joint Discriminative and Generative Learning for Person Re-identification

Published 15 Apr 2019 in cs.CV | (1904.07223v3)

Abstract: Person re-identification (re-id) remains challenging due to significant intra-class variations across different cameras. Recently, there has been a growing interest in using generative models to augment training data and enhance the invariance to input changes. The generative pipelines in existing methods, however, stay relatively separate from the discriminative re-id learning stages. Accordingly, re-id models are often trained in a straightforward manner on the generated data. In this paper, we seek to improve learned re-id embeddings by better leveraging the generated data. To this end, we propose a joint learning framework that couples re-id learning and data generation end-to-end. Our model involves a generative module that separately encodes each person into an appearance code and a structure code, and a discriminative module that shares the appearance encoder with the generative module. By switching the appearance or structure codes, the generative module is able to generate high-quality cross-id composed images, which are online fed back to the appearance encoder and used to improve the discriminative module. The proposed joint learning framework renders significant improvement over the baseline without using generated data, leading to the state-of-the-art performance on several benchmark datasets.

Abstract PDF Upgrade to Chat

Citations (720)

View on Semantic Scholar

Summary

The paper introduces DG-Net, a novel framework that jointly leverages generative and discriminative modules to address intra-class variations in person re-identification.
It decomposes pedestrian images into appearance and structure codes, enabling the synthesis of realistic images that boost re-id model training with superior FID and SSIM metrics.
Empirical results across multiple benchmarks show significant improvements in Rank@1 and mAP, establishing the framework as a state-of-the-art solution for re-id challenges.

Joint Discriminative and Generative Learning for Person Re-identification

The paper "Joint Discriminative and Generative Learning for Person Re-identification" addresses a significant challenge in person re-identification (re-id)—the issue of intra-class variations across different cameras. The authors present a novel approach that integrates generative models with discriminative re-id learning in an end-to-end framework, thereby enhancing the robustness of re-id embeddings against input variations.

Summary of Key Contributions

The primary innovation of this paper is the proposal of a joint learning framework named DG-Net, which tightly couples generative and discriminative modules for re-id. The generative module is designed to decompose each pedestrian image into two latent spaces: an appearance space and a structure space. This modularity allows the generative component to create high-quality cross-id composed images, which are subsequently used to improve the discriminative module.

Generative Component

The generative module consists of:

An appearance encoder ( $E_a$ ): Extracts appearance codes that encapsulate clothing, shoes, and other id-related cues.
A structure encoder ( $E_s$ ): Extracts structure codes that capture body size, pose, background, etc.
A decoder ( $G$ ): Synthesizes images by combining appearance and structure codes.
A discriminator ( $D$ ): Ensures the realism of generated images.

The module can generate images by swapping appearance or structure codes between two images. This functionality promotes the creation of realistic and diverse synthetic images that introduce varied intra-class variations to the training data.

Discriminative Component

The discriminative module is embedded within the generative module by sharing the appearance encoder ( $E_a$ ). Two specific learning tasks are introduced:

Primary Feature Learning: Utilizes a teacher-student model to assign dynamic soft labels to synthetic images, emphasizing the structure-invariant clothing information.
Fine-grained Feature Mining: Focuses on id attributes such as carrying, hair, or body size, which are invariant to clothing, thereby enhancing the discriminative power of the re-id model.

Empirical Evaluations

Generative Performance

The authors evaluate the generative quality using Fréchet Inception Distance (FID) and Structural SIMilarity (SSIM) metrics. DG-Net significantly outperforms other generative methods such as LSGAN, PG $^2$ -GAN, PN-GAN, and FD-GAN on both realism and diversity. For instance, DG-Net achieves an FID of 18.24 compared to the next best score of 54.23 by PN-GAN, indicating superior visual fidelity. Furthermore, the interpolation experiments demonstrate that the learned appearance space is continuous and capable of smoothly transforming between different identities, validating the robustness and generalizability of the approach.

Re-identification Performance

Extensive experiments on three benchmark datasets—Market-1501, DukeMTMC-reID, and MSMT17—demonstrate that DG-Net achieves state-of-the-art performance. The combined feature learning framework (primary and fine-grained) consistently outperforms the baseline ResNet50 by significant margins: an average improvement of 6.1% in Rank@1 and 12.4% in mAP across the datasets. The end-to-end integration of generative and discriminative learning is shown to be more effective than training them separately, as evidenced by improved mAP scores.

Implications and Future Directions

The joint learning framework proposed in this paper has several theoretical and practical implications. It successfully demonstrates that coupling generative and discriminative processes in a unified network can substantially enhance re-id performance. The modular design of the generative component, separating appearance and structure, provides a flexible mechanism to generate high-quality and diverse training samples without requiring additional pose or segmentation data.

For future developments, the integration of more sophisticated generative models and exploring unsupervised or semi-supervised scenarios could be fruitful directions. Additionally, addressing the limitation related to rare patterns such as logos on t-shirts could further refine the generative capabilities of DG-Net.

In conclusion, the paper presents a significant advancement in person re-identification by leveraging the synergies between generative and discriminative learning. The proposed DG-Net framework sets a new benchmark for both generative image quality and re-id accuracy, offering a comprehensive solution to the challenge of intra-class variation in re-id tasks.

Markdown Report Issue