Cross-Instance Texture Transfer
- Cross-instance texture transfer is the process of extracting texture from one instance and applying it to another while preserving geometric structure.
- Recent methods employ UV mapping, semantic alignment, and feature space transport to enable high-fidelity, real-time texture reapplication.
- Key applications include virtual try-on, medical imaging harmonization, and 3D scene editing, demonstrating robust, instance-agnostic transfer.
Cross-instance texture transfer refers to the process of extracting texture information from one object or material instance and applying it to another, typically while preserving the recipient’s structural or geometric properties. This task arises in computer graphics, vision, and medical imaging, addressing visual coherence, semantic alignment, and high-fidelity editing across object categories, imaging modalities, or 2D/3D representations. Robust cross-instance texture transfer must disentangle texture from shape and structure, maintain consistency under varying spatial supports, and often generalize to novel object or image instances within or across categories.
1. Foundational Problem Setting and Key Applications
Cross-instance texture transfer has emerged as a response to limitations in single-instance or per-category style transfer, aiming for semantic alignment of texture across varying objects, anatomies, or views. In 3D graphics, this enables the propagation of visually descriptive surface detail from one mesh to another (e.g., retargeting clothing prints or photorealistic face textures) (Mir et al., 2020, Cohen-Bar et al., 20 Mar 2025, Song et al., 27 Aug 2025, Chen et al., 2022), as well as editing 3D objects from a single reference view (Cao et al., 24 Mar 2025). In the medical imaging domain, texture transfer addresses quantification variability from inter-scanner or cross-protocol differences, enhancing reconstruction and harmonization (Hu, 2021, Uhm et al., 24 Sep 2025). In general image or video editing, the goal is often the faithful transfer of material, pattern, or style characteristics from a reference object to one or more target objects while strictly preserving underlying geometry (Huang et al., 4 Dec 2025).
Key applications include:
- Virtual try-on and avatar personalization (XR/gaming) (Mir et al., 2020, Song et al., 27 Aug 2025)
- Medical protocol harmonization and super-resolution (Hu, 2021, Uhm et al., 24 Sep 2025)
- Automated 3D scene population with varied instances (Chen et al., 2022, Cohen-Bar et al., 20 Mar 2025)
- Controllable image/video object retexturing (Huang et al., 4 Dec 2025)
- Single-shot or limited-data content creation and domain adaptation (Rodriguez-Pardo et al., 2021)
2. Core Methodological Paradigms
Approaches to cross-instance texture transfer can be broadly categorized according to representation, alignment, and transfer mechanism:
A. UV-parameterization and alignment:
Several frameworks construct, learn, or regularize UV mappings to establish semantic correspondence between surface points of different 3D meshes. AUV-Net, for instance, learns a shared UV space across a category by embedding each mesh surface through MLP-based mappings and unsupervised subspace alignment (Chen et al., 2022). This strategy allows part-level correspondences (e.g., aligning all car wheels in the same UV location), and enables texture transfer by direct atlas “swapping.”
B. Semantic feature and triplane-based correspondence:
TriTex leverages a triplane architecture, projecting semantic descriptors derived from pretrained feature extractors onto aligned feature planes, followed by MLP-based color regression. By sharing the volumetric texture field and its decoder, it achieves cross-instance transfer after normalizing the target mesh and extracting the corresponding semantic features (Cohen-Bar et al., 20 Mar 2025).
C. Dense correspondence and silhouette-driven mapping:
Pix2Surf bypasses color at both input and intermediate layers, training a convolutional network to map 2D garment silhouettes to UV coordinates based purely on shape information; high-fidelity texture transfer is achieved by predicting this mapping and resampling the 2D input image (Mir et al., 2020).
D. Barycentric UV conversion:
For fixed correspondence cases, a barycentric coordinate framework precomputes the affine mapping between source and target UV triangles, consolidating the transfer into a single sparse “gather” operation. This eliminates per-triangle blending artifacts and achieves extreme speedups (7,000×) in transferring textures between registered meshes (Song et al., 27 Aug 2025).
E. Feature space transport and alignment:
Methods such as Optimal Textures operate in the latent or feature space of deep autoencoders, using optimal transport (OT) between neural activation PDFs to map output feature samples to those of an exemplar, iteratively aligning their marginalized distributions via random projections (“slices”). This unifies color and style transfer under one OT-based process and supports user-guided targeting through masking and resampling (Risser, 2020).
F. Disentanglement and diffusion-based editing:
Refaçade addresses structure-texture leakage by first removing all appearance signals from the target via a learned texture remover and then shuffling the reference’s patches (“jigsaw permutation”) to suppress global layout information. Transfer occurs via diffusion models conditioned on disentangled geometry and stochastic, patch-wise texture cues, applicable to both images and videos (Huang et al., 4 Dec 2025).
G. Multi-reference attention and medical interpolation:
ACVTT introduces a multi-reference non-local attention mechanism that fuses high-res in-plane texture signals from multiple slices to enhance through-plane reconstruction in anisotropic CT volumes. Texture features from several references are fused at each encoder level based on adaptive relevance scores, allowing precise transfer of fine textural details across slices (Uhm et al., 24 Sep 2025).
3. Alignment, Disentanglement, and Semantic Consistency
A central challenge is ensuring that transferred texture “lands” in semantically meaningful locations on the target instance and that structure (pose, geometry, viewpoint) is preserved:
- UV/Atlas Alignment: AUV-Net achieves unsupervised UV alignment by enforcing that all textures in a category lie within a low-dimensional basis; this forces correspondences such that, e.g., all eyes or wheels are consistently mapped, as evidenced by superior part IoU metrics (Chen et al., 2022).
- Semantic Feature Normalization: TriTex normalizes new target meshes into the same canonical frame. By extracting semantic features and processing them through a shared triplane/MLP pipeline, visually coherent texture mappings are learned from a single source and generalized to diverse new instances (Cohen-Bar et al., 20 Mar 2025).
- Conditional Disentanglement: Refaçade makes explicit the separation of geometry and texture by (1) removing all texture from the source using paired 3D renderings and diffusion models, and (2) eliminating the structure from the texture donor with jigsaw permutation. Empirical ablation shows that both components are necessary: without them, either geometry or texture contamination results (Huang et al., 4 Dec 2025).
- Attention-based Semantic Fusion: Multi-reference attention models (ACVTT) use attention similarity and adaptive weighting to select texture-rich features for transfer, especially critical in medical or volumetric data where geometry and cross-view context diverge (Uhm et al., 24 Sep 2025).
4. Quantitative and Qualitative Outcomes
Performance of cross-instance texture transfer methods is typically evaluated on perceptual fidelity, geometric consistency, and semantic correspondence:
- Metrics: Common quantitative metrics include LPIPS, SSIM, PSNR, SIFID, and CLIP-based similarity. Task-specific measures, such as 3D part IoU or normal-map error, are also used.
- 3D and Medical Domains:
- TriTex achieves SIFID 0.22 (vs. 0.29–0.38), CLIP-sim 0.87, and 0.14 LPIPS on a 3D mesh benchmark, outperforming prior methods on both structural and perceptual similarity (Cohen-Bar et al., 20 Mar 2025).
- ACVTT reports PSNR values of 39–42 dB and SSIM of 0.94–0.97 on CT interpolation, with ablation confirming the essential role of multi-reference attention (removal degrades PSNR by ~0.9 dB) (Uhm et al., 24 Sep 2025).
- Pix2Surf is preferred by 100% of participants in qualitative studies over Shape-Context+TPS and pix2pix baselines for garment-to-SMPL transfer, particularly for preserving pattern edges and avoiding seam artifacts (Mir et al., 2020).
- General Image/Video Editing:
- Refaçade attains 0.7774 CLIP, 0.4516 DINO, 0.6181 LPIPS, and 0.8184 DreamSim on images, with user preference at 89.44%. On videos, background PSNR reaches 36.48 and foreground consistency is demonstrably superior to VideoPainter and other baselines (Huang et al., 4 Dec 2025).
- 3DSwapping achieves 0.9333 CLIP-Score and 0.1166 LPIPS (AlexNet), with user study rating of 4.54/5, supporting effective 3D texture propagation from sparse reference, outperforming 2D and text-driven alternatives (Cao et al., 24 Mar 2025).
- Efficiency and Scalability: Barycentric UV conversion enables real-time avatar personalization (14,743 s → 1.98 s for full transfer), eliminates inter-triangle seams, and achieves SSIM 0.98 with LPIPS 0.017—significantly ahead of previous affine-blend schemes (Song et al., 27 Aug 2025).
5. Limitations, Robustness, and Extensions
Robust cross-instance texture transfer frameworks confront several open challenges:
- Correspondence Generality: Most UV/atlas-based methods require at least rough semantic pose alignment or known regions of interest (e.g., facial meshes, anatomically similar organs) (Chen et al., 2022, Song et al., 27 Aug 2025). Generalizing to arbitrary topology, non-rigid or non-isomorphic surfaces remains non-trivial.
- Reference Dependency and Expressivity: Single-source methods risk missing unseen textural variations; approaches like TriTex cannot hallucinate novel detail beyond the source mesh (Cohen-Bar et al., 20 Mar 2025).
- Structural/Texture Entanglement: ControlNet-style and classical edge/depth guidance, without explicit disentanglement, can propagate undesirable structural information from the texture donor (Huang et al., 4 Dec 2025).
- Computational Constraints: Sliced optimal transport and large-scale pyramid frameworks yield interactive rates but may be constrained by decoder expressivity or dimensionality in high-resolution scenarios (Risser, 2020).
- Medical Imaging Anisotropy: Multi-model strategies (e.g., ACVTT) require discrete model instantiation per scaling factor and may benefit from further advances in continuous, task-adaptive reference selection (Uhm et al., 24 Sep 2025).
Suggested extensions include:
- Cross-species or universal mesh alignment via dense correspondence learning (Song et al., 27 Aug 2025)
- Multi-modal texture sources and cross-modal transfer (e.g., BRDFs, normal maps, depth) (Huang et al., 4 Dec 2025)
- Joint alignment of shape and texture mappings for fully instance-agnostic transfer (Cao et al., 24 Mar 2025)
- Rotation- or deformation-invariant semantic descriptors for unconstrained mesh domains (Cohen-Bar et al., 20 Mar 2025)
6. Representative Algorithms and Comparative Table
| Method/Framework | Core Mechanism | Distinctive Strength |
|---|---|---|
| AUV-Net (Chen et al., 2022) | UV mapping + subspace alignment | Unsupervised, 3D part alignment, supports generative 2D modeling |
| Pix2Surf (Mir et al., 2020) | Silhouette-to-UV CNN, no color input | Real-time, shape-only generalization, seamless 3D try-on |
| TriTex (Cohen-Bar et al., 20 Mar 2025) | Semantic triplane, Diff3F features | Single-mesh generalization, fast per-instance transfer |
| Barycentric UV (Song et al., 27 Aug 2025) | Precomputed per-triangle mapping | Extreme speed, no boundary seams, fixed correspondence |
| Refaçade (Huang et al., 4 Dec 2025) | Diffusion + explicit disentanglement | Controllable object retexturing, video/image, structure/texture separation |
| Optimal Textures (Risser, 2020) | OT in feature space, multi-scale | Fast, interactive, robust to complex textures |
| ACVTT (Uhm et al., 24 Sep 2025) | Multi-view attention fusion | Medical/interpolation, cross-slice semantic transfer |
| 3DSwapping (Cao et al., 24 Mar 2025) | Progressive multi-view diffusion | 3D Gaussian splats, CLIP-guided, view-consistency |
7. Empirical Benchmarks and Impact
Empirical studies consistently demonstrate that methods combining semantic alignment (AUV-Net, TriTex), attention-guided fusion (ACVTT), or explicit disentanglement (Refaçade, 3DSwapping) outperform both naïve pixel- or edge-based baselines and legacy optimization-driven style transfer. The impact spans fields from interactive digital avatars and immersive XR to quantitative medical analysis and unsupervised single-image attribute propagation. State-of-the-art frameworks exhibit robust generalization across deformations, varying illumination, and diverse target instances (Mir et al., 2020, Huang et al., 4 Dec 2025, Song et al., 27 Aug 2025, Chen et al., 2022, Cohen-Bar et al., 20 Mar 2025, Risser, 2020, Uhm et al., 24 Sep 2025, Cao et al., 24 Mar 2025).
Ongoing research is directed at increasing the expressivity, domain generality, and efficiency of cross-instance texture transfer, particularly for unsupervised, dynamic, or multi-modal scenarios.