Clothing Appearance Drift

Updated 4 January 2026

Clothing appearance drift is the unwanted migration or distortion of garment boundaries, textures, or semantics in synthetic or captured imagery.
Techniques like dual-path warping, limb-aware modules, and layered representations actively mitigate drift by aligning shapes and preserving textures.
Evaluation metrics such as SSIM, PSNR, and FID quantify improvements, enhancing virtual try-on, avatar animation, and person re-identification applications.

Clothing appearance drift denotes the unwanted migration or distortion of garment boundaries, textures, or semantics across spatial, temporal, or generative axes in synthetic or captured imagery. This phenomenon arises in virtual try-on, avatar animation, monocular capture, and person re-identification, manifesting as sliding patterns, bleeding silhouettes, misaligned logos, or deformation of garment shapes. State-of-the-art approaches systematically decompose, model, and constrain these sources of drift, employing loss functions, architectural modularity, data augmentation, and semantic disentanglement to achieve robust appearance fidelity.

1. Taxonomy and Manifestation of Clothing Appearance Drift

Clothing appearance drift is classified by its phenomenology and evaluation metrics. In image-based try-on, shape drift involves discrepancy between the synthesized garment outline and the person’s true silhouette, quantifiable by misaligned-pixel count and mask IoU between the warped garment mask $M_w$ and ground-truth $M_{gt}$ (Han et al., 21 Apr 2025). Texture drift occurs when local embroidery, logos, or patterning are smeared, shifted, or partially erased in generation or transfer. In dynamic monocular capture, temporal drift aggregates over video frames as cumulative deviation in mesh geometry or UV texture (e.g., sliding patterns, mesh jitter) (Xiang et al., 2020). For long-term re-identification, drift incorporates semantic and distributional garment changes, requiring shape-appearance disentanglement (Nguyen et al., 2024).

Typical quantitative metrics include SSIM and PSNR for structural and perceptual fidelity, FID or KID for generative realism, and LPIPS for perceptual similarity on high-frequency regions. As illustrated by SCW-VTON, methods attaining low drift yield SSIM $\uparrow$ and PSNR $\uparrow$ on warped garments, and FID $\downarrow$ on try-on images, outperforming baselines in both paired and unpaired evaluations (Han et al., 21 Apr 2025).

2. Explicit Drift Modeling and Suppression Architectures

Suppressing drift requires explicit modeling and architectural modularity:

Dual-path warping / progressive flow: SCW-VTON introduces a shape-guided and flow-guided dual branch. The shape path predicts a body-aligned garment map, and cross-attention computes pixel correspondences to minimize misalignment. The flow path regresses a dense deformation field, ensuring textures conform to the evolved silhouette (Han et al., 21 Apr 2025). PL-VTON uses affine pre-alignment followed by multi-scale flow aggregation (GRU) to avoid global misplacement and local tearing (Zhang et al., 18 Mar 2025).
Limb-aware and semantic parsing modules: Limb-aware fusing, semantic region parsing (PPE), and non-limb prior maps localize garment boundaries and prevent texture bleeding into limbs or body. Fine segmentation losses preserve sharp silhouettes and reduce region-crossing drift (Zhang et al., 18 Mar 2025).
Static-dynamic decomposition: PLTON decomposes rendered appearance into a static HF-Map (high-frequency Laplacian of the warped garment) and a dynamic token extracted from garment embedding (via CLIP). The static extractor ensures preservation of fine logos and embroidery, while the dynamic extractor steers diffusion-based generative modules to produce pose, lighting, and wrinkle-adaptive shading (Zang et al., 2024).
Layered representations: LayGA employs a two-stage Gaussian map training; the first stage reconstructs a clean, smooth surface and segments body and cloth. The second stage fits separate body and cloth layers with collision losses, enforcing persistent body-cloth separation and preventing sliding or interpenetration during animation (Lin et al., 2024).

3. Loss Functions and Constraint Mechanisms

Multiple losses are designed to minimize appearance drift across tasks:

Gravity-aware and edge-weighted losses: PL-VTON’s gravity-aware edge loss decays the penalty from shoulders to hem, anchoring the cloth top and permitting realistic drape at the hem, counteracting “slippage” (Zhang et al., 18 Mar 2025).
Cross-attention and perceptual losses: Dual-path warpers rely on VGG-19 perceptual losses on both shape and texture (Han et al., 21 Apr 2025). Mask losses penalize deviation in garment boundaries.
Semantic replacement layout and segmentation loss: Cross-entropy across semantic classes preserves region-specific label integrity, enhancing the garment’s spatial layout (Han et al., 21 Apr 2025, Zhang et al., 18 Mar 2025).
UV texture growing and temporal smoothness: In monocular video, UV growing functions as a per-vertex photometric anchor, preventing cumulative pattern drift (Xiang et al., 2020). Batch temporal optimization penalizes variance in shape coefficients across frames.
Collision and layer separation losses: LayGA enforces geometric integrity by penalizing overlap and clamping rendering displacement, preserving tangential garment slide while avoiding body penetrance (Lin et al., 2024).

4. Semantic Disentanglement and Data Augmentation

Recent systems explicitly disentangle garment cues:

Factor decomposition: TryOn-Adapter formulates clothing identity as $(s,\tau,\rho)$ —style, texture, and structure; separate adapter modules inject CLIP-derived style, high-frequency texture, and segmentation-based structure into a frozen diffusion backbone, producing high-fidelity identity preservation while minimizing resource requirements (Xing et al., 2024).
Contrastive augmentation: CCPA for person re-ID swaps appearance and shape codes across identities, augmenting the training set with all pairwise combinations (Nguyen et al., 2024). Fine-grained contrastive losses (FC $^a$ , FC $^s$ ) ensure that embeddings of the same person under different clothing remain proximate, while embeddings of different persons in similar outfits are separated.

5. Representative Methods: Comparative Summary

Approach	Core Drift Mitigation	Distinctive Loss/Module
SCW-VTON (Han et al., 21 Apr 2025)	Dual-path shape and flow warping	Shape-guided cross-attention; limb reconstruction
PL-VTON (Zhang et al., 18 Mar 2025)	Progressive multi-scale warping	Gravity-aware edge loss; semantic parsing
PLTON (Zang et al., 2024)	Static-dynamic rendering	CLIP-based dynamic tokens; HF-Map preservation
LayGA (Lin et al., 2024)	Layered Gaussian avatars	Collision/layer loss; Laplacian post-processing
TryOn-Adapter (Xing et al., 2024)	Identity factor adapters	Style, Texture, Structure modules; T-RePaint
MonoClothCap (Xiang et al., 2020)	Statistical deformation + UV	UV growing; temporal batch smoothing
CCPA Re-ID (Nguyen et al., 2024)	Augmentation + contrastive loss	R-GAT shape encoder; fine-grained losses

Key efficacy metrics include SSIM, LPIPS, FID/KID, and user study preferences. TryOn-Adapter attains $0.069$ LPIPS (lowest among tested methods), $0.897$ SSIM (highest), and $8.62$ FID (lowest unpaired) while requiring only $\sim$ 50% of tunable parameters compared to other diffusion-based baselines (Xing et al., 2024).

6. Ablation Results and Interpretive Insights

Ablation studies confirm causal efficacy:

Removing cross-attention (SCW-VTON*) increases pixel drift, lowering SSIM/PSNR and FID (Han et al., 21 Apr 2025).
Excluding limb reconstruction elevates FID and degrades limb texture fidelity.
In LayGA, disabling multi-layer separation reintroduces garment sliding and drift artifacts, whereas post-processing geometric smoothing eliminates test-time collision (Lin et al., 2024).
In TryOn-Adapter, adapters (style, texture, structure) each individually lower LPIPS and FID, with full composition yielding the highest SSIM and lowest drift (Xing et al., 2024).
PLTON’s two-stage denoising approach stabilizes both logo placement and wrinkle realism, as indicated by LPIPS/SSIM sweeps (Zang et al., 2024).

These results collectively suggest that explicit architectural, loss, and representation stratification directly curtail drift, and that modular injection of domain-specific cues (semantic, geometric, high-frequency, data-augmented) is essential for robust appearance preservation.

7. Broader Implications and Applications

Clothing appearance drift remains a central technical barrier in virtual try-on, animation, video-based capture, and long-term biometric identification. Advances in explicit shape-guided warping, layered representations, contrastive embedding and identity factorization have demonstrated practical reduction of drift across image, video, and generative modalities. Ongoing research exploits modular design, task-specific loss engineering, large pretrained backbones, and data augmentation to generalize across identities, poses, and environments, ensuring persistent, interpretable garment fidelity in synthetic human modeling systems.