Dress Code: High-Resolution Multi-Category Virtual Try-On

Published 18 Apr 2022 in cs.CV, cs.AI, cs.GR, and cs.MM | (2204.08532v2)

Abstract: Image-based virtual try-on strives to transfer the appearance of a clothing item onto the image of a target person. Prior work focuses mainly on upper-body clothes (e.g. t-shirts, shirts, and tops) and neglects full-body or lower-body items. This shortcoming arises from a main factor: current publicly available datasets for image-based virtual try-on do not account for this variety, thus limiting progress in the field. To address this deficiency, we introduce Dress Code, which contains images of multi-category clothes. Dress Code is more than 3x larger than publicly available datasets for image-based virtual try-on and features high-resolution paired images (1024x768) with front-view, full-body reference models. To generate HD try-on images with high visual quality and rich in details, we propose to learn fine-grained discriminating features. Specifically, we leverage a semantic-aware discriminator that makes predictions at pixel-level instead of image- or patch-level. Extensive experimental evaluation demonstrates that the proposed approach surpasses the baselines and state-of-the-art competitors in terms of visual quality and quantitative results. The Dress Code dataset is publicly available at https://github.com/aimagelab/dress-code.

Abstract PDF Upgrade to Chat

Citations (93)

View on Semantic Scholar

Summary

The paper presents Dress Code, a high-resolution dataset addressing previous limitations by featuring over 50,000 image pairs across diverse garment categories.
It introduces a novel three-stage pipeline and a Pixel-level Semantic-Aware Discriminator (PSAD) that significantly enhances image synthesis realism.
Experimental results show superior performance over nine state-of-the-art methods, with improved metrics like FID and KID in multi-garment and high-resolution settings.

Dress Code: High-Resolution Multi-Category Virtual Try-On

Introduction

The paper introduces "Dress Code," a high-resolution dataset for virtual try-on that incorporates a diverse range of clothing categories, addressing significant limitations found in previous datasets. The focus of image-based virtual try-on (VTON) is to realistically synthesize images of a person wearing a target garment without altering their intrinsic attributes like body shape and pose. This paper leverages the Dress Code dataset to propose a new architecture that leverages a Pixel-level Semantic Aware Discriminator (PSAD) for enhanced visual realism.

Figure 1: Differently from existing publicly available datasets for virtual try-on, Dress Code features different garments, also belonging to lower-body and full-body categories, and high-resolution images.

Dataset Description

Dress Code dataset stands out due to its inclusion of over 50,000 high-resolution image pairs (1024x768 pixels), encapsulating a wide variety of upper-body, lower-body, and full-body clothing. Motivated by the constraints of existing datasets, which fail to cater to full-body and lower-body garments, Dress Code offers a significant advancement in terms of scale and detail.

Figure 2: Sample image pairs from the Dress Code dataset with pose keypoints, dense poses, and segmentation masks of human bodies.

Proposed Architecture

The architecture employs a three-stage pipeline: garment warping, human parsing estimation, and final image generation. The innovation in this architecture is the introduction of the PSAD, which enhances image realism by learning pixel-level semantic correlations instead of focusing only on image-level or patch-level details.

Figure 3: Overview of the proposed architecture.

Pixel-level Semantic-Aware Discriminator (PSAD)

PSAD improves image synthesis by embedding detailed semantic class predictions at the pixel level. It learns internal semantic representations and enforces the generator to produce high-quality images by accounting for semantic discrepancies.

Experimental Results

Dress Code serves as a new benchmark for VTON systems, meticulously compared against nine state-of-the-art approaches. Models were trained across various image resolutions, including 512x384, and 1024x768, highlighting the capacity of Dress Code to maintain synthesis quality at high resolutions.

High-Resolution and Multi-Garment Settings

The proposed system showcases superior performance in high-resolution settings, maintaining image details across different clothing categories. Further, a novel multi-garment try-on setting was introduced, demonstrating effective image synthesis in scenarios that require multiple garments to be overlaid seamlessly.

Figure 4: Sample try-on results on the Dress Code test set.

Qualitative Comparisons

Significant improvements are observed in qualitative assessments, where Dress Code outperformed traditional patch-based baselines. Particularly, PSAD achieved significant improvements in synthesis realism over common metrics such as the Frechet Inception Distance (FID) and Kernel Inception Distance (KID).

Figure 5: Qualitative comparison between Patch and PSAD.

Conclusion

The Dress Code dataset and associated architecture address critical shortcomings in existing VTON frameworks by supporting high-resolution multi-garment try-ons, significantly advancing the real-world applicability of virtual try-on systems. Future work will likely explore more intricate interactions between multiple garment layers using the Dress Code framework. The dataset and models can drive further research, potentially extending into more dynamic domains such as video-based try-on scenarios.

Markdown Report Issue