Re-CatVTON: Efficient Diffusion VTON Model
- Re-CatVTON is a diffusion-based virtual try-on model that leverages a single-UNet design with precise context feature conditioning and ground-truth garment injection.
- The model resolves context and denoising conflicts via spatial concatenation and selective loss exclusion, enhancing image realism and mitigating computational overhead.
- It achieves competitive FID, KID, and LPIPS scores on VITON-HD and DressCode benchmarks while delivering rapid inference and resource efficiency.
Re-CatVTON is a diffusion-based virtual try-on (VTON) model that advances the efficiency-performance boundary for synthesizing images of people wearing target garments. It is based on rigorous analysis of context feature conditioning and adopts a fundamentally single-UNet design, incorporating mechanisms to address context/denoising conflicts inherent to spatial concatenation. Re-CatVTON delivers superior or competitive image realism and fidelity versus prior methods, with a resource profile comparable to the most efficient single-UNet models while closing the gap with much heavier dual-UNet architectures (Na et al., 24 Nov 2025).
1. Motivation and Design Rationale
Virtual try-on aims to generate images of a person () wearing a specified garment (). Dual-UNet diffusion models (e.g., Leffa, IDM-VTON) leverage a reference UNet to encode garment context and a try-on UNet for denoising, achieving high fidelity through explicit context fusion at each block, but at significant computational and memory cost.
Re-CatVTON hypothesizes that such context injection can be reframed via single-UNet mechanisms without sacrificing quality if conditioning conflicts are resolved. Three key theoretical insights motivate the architecture:
- Functional mismatch: Repurposing a pretrained denoising UNet as a context encoder without fine-tuning results in suboptimal capacity allocation.
- Time-aligned context: Conditioning must account for the frequency-band recovery schedule of diffusion, distributing timestep encoding across both denoising and context regions.
- Garment region loss exclusion: Penalizing noise estimation in the concatenated garment region is counterproductive when it should provide conditioning, not be denoised.
These insights lead to a single-UNet architecture with spatial concatenation, context-region loss exclusion, and ground-truth latent injection.
2. Architectural Structure and Conditioning Workflow
Re-CatVTON utilizes a sole diffusion UNet for both noise prediction and context feature encoding. The input construction at each timestep is:
where:
- , are the noisy VAE latents of the person and garment,
- , are their clean VAE latents, obtained via VAE encoding,
- denotes spatial concatenation along the channel or HW axes as required.
Each denoising step can be represented as:
1 2 3 |
construct X_t as above # may use GT-injected z_t^g (see Sec. 4) epsilon_hat_t = epsilon_theta(X_t, t) z_{t-1} = sampler(z_t, epsilon_hat_t, t) |
The entire UNet is fine-tuned for VTON, in contrast to CatVTON, where only self-attention was adapted (Chong et al., 2024).
3. Context Feature Learning and Loss Formulation
Re-CatVTON explicitly addresses the context-encoding/denoising functional mismatch:
- The model excludes the noise-prediction loss in the garment (context) region, focusing all -prediction residuals on the person/outfit region. Mathematically, if is the UNet output, only is supervised.
The loss integrates the DREAM-style rectified target for the person region:
where , , and is a stop-gradient copy of the UNet output.
The final training objective is:
The garment region incurs no backpropagated loss and thus solely provides context features (Na et al., 24 Nov 2025).
4. Modified Guidance and Ground-Truth Injection
Modified Classifier-Free Guidance (CFG)
In CatVTON, the unconditional branch retained partial garment information. Re-CatVTON ensures a strictly unconditional branch by setting all garment-related latents to zero during the unconditional forward pass:
Guided noise prediction follows:
where is the guidance scale (Na et al., 24 Nov 2025).
Ground-Truth Garment Latent Injection
To prevent error accumulation from repeatedly denoising the garment region, at every step Re-CatVTON injects the ground-truth noisy garment latent:
This stabilizes the context branch by maintaining precise garment features throughout the reverse process (Na et al., 24 Nov 2025).
5. Empirical Performance and Comparative Metrics
Quantitative results on VITON-HD and DressCode datasets are summarized below.
| Method | FID↓ | KID↓ | SSIM↑ | LPIPS↓ | FID_unp↓ | KID_unp↓ |
|---|---|---|---|---|---|---|
| CatVTON | 5.888 | 0.513 | 0.870 | 0.061 | 9.015 | 1.091 |
| Leffa | 4.540 | 0.050 | 0.899 | 0.048 | 8.520 | 0.320 |
| Re-CatVTON | 4.438 | 0.010 | 0.880 | 0.047 | 8.266 | 0.517 |
VITON-HD, paired and unpaired results.
| Method | FID↓ | KID↓ | SSIM↑ | LPIPS↓ | FID_unp↓ | KID_unp↓ |
|---|---|---|---|---|---|---|
| CatVTON | 3.992 | 0.818 | 0.892 | 0.046 | 6.137 | 1.403 |
| Leffa | 2.060 | 0.070 | 0.924 | 0.031 | 4.480 | 0.620 |
| Re-CatVTON | 2.175 | 0.062 | 0.914 | 0.031 | 4.310 | 0.628 |
DressCode, paired and unpaired results.
Re-CatVTON closes the gap to Leffa—a dual-UNet model—with only a marginal SSIM cost relative to CatVTON, while outperforming all single-UNet baselines on FID, KID, and LPIPS (Na et al., 24 Nov 2025).
6. Computational Efficiency and Resource Requirements
| Method | Params (M) | GFLOPs | Latency (s) | Peak VRAM (GB) |
|---|---|---|---|---|
| OOTDiffusion | 2229.7 | 1225.2 | 1.5 | 5.93 |
| IDM-VTON | 7003.3 | 2679.5 | 6.6 | 14.62 |
| Leffa | 1802.7 | 1012.0 | 2.7 | 3.91 |
| CatVTON | 859.5 | 974.0 | 1.3 | 2.26 |
| Re-CatVTON | 859.5 | 974.0 | 1.3 | 2.26 |
Evaluated on 512×384 images, H200 GPU, batch size 1, FP16.
Re-CatVTON matches CatVTON’s lightweight computational cost while delivering fidelity approaching that of heavyweight dual-UNet architectures. Its efficiency stems from a single-UNet design, use of spatial concatenation for multimodal conditioning, and loss targeting that avoids backpropagation in the garment area (Na et al., 24 Nov 2025).
7. Limitations and Prospective Improvements
Identified limitations include:
- Marginal decrease in SSIM when compared with the highest-performing dual-UNet model.
- The model inherits dependence on mask accuracy and may exhibit minor artifacts in complex garment–person interactions.
- The reconstructed garment region is conditioned solely through spatial concatenation and ground-truth latent injection; rare pattern generalization and localization under extreme deformations or atypical garment types remain challenges.
Proposed future improvements (as suggested in (Chong et al., 2024)) involve hybrid VAE bottlenecks, joint mask/image refinement, and compact warping-guided modules.
Re-CatVTON establishes a new efficiency–performance trade-off for single-UNet VTON models and demonstrates that, with appropriate architectural decisions and loss formulation, dual-UNet computational overhead is avoidable for state-of-the-art VTON synthesis (Na et al., 24 Nov 2025, Chong et al., 2024).