Papers
Topics
Authors
Recent
Search
2000 character limit reached

Re-CatVTON: Efficient Diffusion VTON Model

Updated 1 December 2025
  • Re-CatVTON is a diffusion-based virtual try-on model that leverages a single-UNet design with precise context feature conditioning and ground-truth garment injection.
  • The model resolves context and denoising conflicts via spatial concatenation and selective loss exclusion, enhancing image realism and mitigating computational overhead.
  • It achieves competitive FID, KID, and LPIPS scores on VITON-HD and DressCode benchmarks while delivering rapid inference and resource efficiency.

Re-CatVTON is a diffusion-based virtual try-on (VTON) model that advances the efficiency-performance boundary for synthesizing images of people wearing target garments. It is based on rigorous analysis of context feature conditioning and adopts a fundamentally single-UNet design, incorporating mechanisms to address context/denoising conflicts inherent to spatial concatenation. Re-CatVTON delivers superior or competitive image realism and fidelity versus prior methods, with a resource profile comparable to the most efficient single-UNet models while closing the gap with much heavier dual-UNet architectures (Na et al., 24 Nov 2025).

1. Motivation and Design Rationale

Virtual try-on aims to generate images of a person (IpI_p) wearing a specified garment (IgI_g). Dual-UNet diffusion models (e.g., Leffa, IDM-VTON) leverage a reference UNet to encode garment context and a try-on UNet for denoising, achieving high fidelity through explicit context fusion at each block, but at significant computational and memory cost.

Re-CatVTON hypothesizes that such context injection can be reframed via single-UNet mechanisms without sacrificing quality if conditioning conflicts are resolved. Three key theoretical insights motivate the architecture:

  1. Functional mismatch: Repurposing a pretrained denoising UNet as a context encoder without fine-tuning results in suboptimal capacity allocation.
  2. Time-aligned context: Conditioning must account for the frequency-band recovery schedule of diffusion, distributing timestep encoding across both denoising and context regions.
  3. Garment region loss exclusion: Penalizing noise estimation in the concatenated garment region is counterproductive when it should provide conditioning, not be denoised.

These insights lead to a single-UNet architecture with spatial concatenation, context-region loss exclusion, and ground-truth latent injection.

2. Architectural Structure and Conditioning Workflow

Re-CatVTON utilizes a sole diffusion UNet ϵθ\epsilon_\theta for both noise prediction and context feature encoding. The input construction at each timestep tt is:

Xt=Concatch(ztp  ∥  ztg,  masked garment region of  z0p,  z0p  ∥  z0g)X_t = \text{Concat}_\text{ch}\big( z_t^p\;\Vert\;z_t^g,\; \text{masked garment region of}\;z_0^p,\; z_0^p\;\Vert\;z_0^g \big)

where:

  • ztpz_t^p, ztgz_t^g are the noisy VAE latents of the person and garment,
  • z0pz_0^p, z0gz_0^g are their clean VAE latents, obtained via VAE encoding,
  • ∥\Vert denotes spatial concatenation along the channel or HW axes as required.

Each denoising step can be represented as:

1
2
3
construct X_t as above  # may use GT-injected z_t^g (see Sec. 4)
epsilon_hat_t = epsilon_theta(X_t, t)
z_{t-1} = sampler(z_t, epsilon_hat_t, t)
(Na et al., 24 Nov 2025).

The entire UNet is fine-tuned for VTON, in contrast to CatVTON, where only self-attention was adapted (Chong et al., 2024).

3. Context Feature Learning and Loss Formulation

Re-CatVTON explicitly addresses the context-encoding/denoising functional mismatch:

  • The model excludes the noise-prediction loss in the garment (context) region, focusing all ϵ\epsilon-prediction residuals on the person/outfit region. Mathematically, if ϵ^t=ϵ^tp  ∥  ϵ^tg\hat\epsilon_t = \hat\epsilon_t^p \;\Vert\; \hat\epsilon_t^g is the UNet output, only ϵ^tp\hat\epsilon_t^p is supervised.

The loss integrates the DREAM-style rectified target for the person region:

ϵˉdream,tp=ϵˉtp+ωt(λ)(ϵˉtp−ϵθ,sgp)\bar\epsilon_{\text{dream},t}^p = \bar\epsilon_t^p + \omega_t(\lambda) \left(\bar\epsilon_t^p - \epsilon_{\theta,\text{sg}}^p\right)

where ωt(λ)=(1−αˉt)λ/2\omega_t(\lambda) = (1 - \bar\alpha_t)^{\lambda/2}, λ=10\lambda=10, and ϵθ,sgp\epsilon_{\theta,\text{sg}}^p is a stop-gradient copy of the UNet output.

The final training objective is:

L=Ez0,ϵˉt,t,c[∥ϵˉdream,tp−ϵθp(z~t,t,c)∥22]\mathcal L = \mathbb E_{z_0,\bar\epsilon_t, t, c} \left[ \left\| \bar\epsilon_{\text{dream},t}^p - \epsilon_\theta^p(\tilde z_t, t, c) \right\|_2^2 \right]

The garment region incurs no backpropagated loss and thus solely provides context features (Na et al., 24 Nov 2025).

4. Modified Guidance and Ground-Truth Injection

Modified Classifier-Free Guidance (CFG)

In CatVTON, the unconditional branch retained partial garment information. Re-CatVTON ensures a strictly unconditional branch by setting all garment-related latents to zero during the unconditional forward pass:

Xtuncond=Concatch(ztp  ∥  0, M⊙0, z0p  ∥  0)X_t^{\text{uncond}} = \text{Concat}_\text{ch}(z_t^p \;\Vert\; 0,\, M \odot 0,\, z_0^p \;\Vert\; 0)

Guided noise prediction follows:

ϵ^t=ϵθ(Xtuncond,t)+ω[ϵθ(Xt,t)−ϵθ(Xtuncond,t)]\hat\epsilon_t = \epsilon_\theta(X_t^{\text{uncond}}, t) + \omega\left[\epsilon_\theta(X_t, t) - \epsilon_\theta(X_t^{\text{uncond}}, t)\right]

where ω\omega is the guidance scale (Na et al., 24 Nov 2025).

Ground-Truth Garment Latent Injection

To prevent error accumulation from repeatedly denoising the garment region, at every step Re-CatVTON injects the ground-truth noisy garment latent:

zˉtg=α‾t z0g+1−α‾t ϵg,ϵg∼N(0,I)\bar z_t^g = \sqrt{\overline{\alpha}_t}\,z_0^g + \sqrt{1 - \overline{\alpha}_t}\,\epsilon^g, \quad \epsilon^g \sim \mathcal N(0, I)

This stabilizes the context branch by maintaining precise garment features throughout the reverse process (Na et al., 24 Nov 2025).

5. Empirical Performance and Comparative Metrics

Quantitative results on VITON-HD and DressCode datasets are summarized below.

Method FID↓ KID↓ SSIM↑ LPIPS↓ FID_unp↓ KID_unp↓
CatVTON 5.888 0.513 0.870 0.061 9.015 1.091
Leffa 4.540 0.050 0.899 0.048 8.520 0.320
Re-CatVTON 4.438 0.010 0.880 0.047 8.266 0.517

VITON-HD, paired and unpaired results.

Method FID↓ KID↓ SSIM↑ LPIPS↓ FID_unp↓ KID_unp↓
CatVTON 3.992 0.818 0.892 0.046 6.137 1.403
Leffa 2.060 0.070 0.924 0.031 4.480 0.620
Re-CatVTON 2.175 0.062 0.914 0.031 4.310 0.628

DressCode, paired and unpaired results.

Re-CatVTON closes the gap to Leffa—a dual-UNet model—with only a marginal SSIM cost relative to CatVTON, while outperforming all single-UNet baselines on FID, KID, and LPIPS (Na et al., 24 Nov 2025).

6. Computational Efficiency and Resource Requirements

Method Params (M) GFLOPs Latency (s) Peak VRAM (GB)
OOTDiffusion 2229.7 1225.2 1.5 5.93
IDM-VTON 7003.3 2679.5 6.6 14.62
Leffa 1802.7 1012.0 2.7 3.91
CatVTON 859.5 974.0 1.3 2.26
Re-CatVTON 859.5 974.0 1.3 2.26

Evaluated on 512×384 images, H200 GPU, batch size 1, FP16.

Re-CatVTON matches CatVTON’s lightweight computational cost while delivering fidelity approaching that of heavyweight dual-UNet architectures. Its efficiency stems from a single-UNet design, use of spatial concatenation for multimodal conditioning, and loss targeting that avoids backpropagation in the garment area (Na et al., 24 Nov 2025).

7. Limitations and Prospective Improvements

Identified limitations include:

  • Marginal decrease in SSIM when compared with the highest-performing dual-UNet model.
  • The model inherits dependence on mask accuracy and may exhibit minor artifacts in complex garment–person interactions.
  • The reconstructed garment region is conditioned solely through spatial concatenation and ground-truth latent injection; rare pattern generalization and localization under extreme deformations or atypical garment types remain challenges.

Proposed future improvements (as suggested in (Chong et al., 2024)) involve hybrid VAE bottlenecks, joint mask/image refinement, and compact warping-guided modules.

Re-CatVTON establishes a new efficiency–performance trade-off for single-UNet VTON models and demonstrates that, with appropriate architectural decisions and loss formulation, dual-UNet computational overhead is avoidable for state-of-the-art VTON synthesis (Na et al., 24 Nov 2025, Chong et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Re-CatVTON.