2D–3D Aligned Proxy Representation
- The paper introduces a proxy representation that bridges 2D observations and 3D entities to transfer geometric, semantic, and appearance information.
- It details methodologies like pixel–voxel correspondences, shared latent spaces, and explicit semantic proxies to achieve robust cross-modal alignment.
- Empirical results demonstrate improved segmentation, 3D generation, and editing performance through enhanced proxy alignment and joint representation learning.
A 2D–3D aligned proxy representation denotes a structured modality-bridging construct that encodes geometric, semantic, or appearance relationships between two-dimensional (2D) observations (e.g., images, projections, or labels) and three-dimensional (3D) entities (e.g., point clouds, meshes, volumetric fields). Such representations are architected to enable cross-modal alignment, supervision, or editing by translating correspondences—either explicit or latent—between the 2D and 3D domains. They feature prominently in modern vision, graphics, and multimodal learning frameworks, underpinning architectural designs for joint representation learning, cross-modal generation, video editing, annotation bootstrapping, and controllable asset synthesis.
1. Foundational Motivations and Principles
The impetus for 2D–3D aligned proxy representations arises from the complementary strengths and limitations of 2D and 3D modalities. While 2D data is abundant and often richly annotated (e.g., text-image pairs, 2D segmentation masks), 3D data acquisition remains costly, irregular, and sparse, particularly for tasks requiring direct pairing with semantic or linguistic supervision (Huang et al., 2023, Zhou et al., 2023). The key principle is to construct an auxiliary (proxy) structure that renders information from 2D accessible in 3D and vice versa, enabling transfer of semantics, supervision, or controls without modality-specific ground truth.
Alignment is achieved either by spatial correspondences (e.g., projecting 3D points into 2D pixels under synchronized cameras), encoder alignment in shared latent spaces (e.g., joint contrastive objectives), or through proxy generation (e.g., canonical meshes, semantic box rooms, synthetic cuboids) that structurally bridge the two domains. The aligned proxy thus supplies the locus for training, editing, or inference signal propagation.
2. Architectures and Construction Strategies
a) Pixel–Voxel Correspondence and Shared Latent Spaces
A canonical form aligns latent features via precise spatial correspondences. For example, Text4Point projects 3D voxels into 2D images using RGB-D camera geometry; pixel–voxel pairs within a small geometric threshold are matched, enabling a contrastive InfoNCE loss to tightly couple CLIP image encoder features and learned 3D point features (Huang et al., 2023). This alignment propagates semantics from 2D–text CLIP space to 3D, even absent direct 3D–text supervision.
Similarly, Uni3D initializes a Vision Transformer for 3D point clouds, aligning their global feature vectors with those of CLIP image and text encoders via a four-term symmetric contrastive loss, using batches of (point cloud, image, caption) triplets (Zhou et al., 2023). The resulting global proxy embedding provides a foundation for 3D–2D retrieval, zero-shot transfer, and open-vocabulary tasks.
b) Semantic Proxies and 3D Box Representations
For scene synthesis and editing, explicit 3D semantic proxies are constructed. ControlRoom3D defines a proxy room as a set of axis-aligned semantic boxes (bed, table, etc.), each parameterized by center, size, class, and ID (Schult et al., 2023). This proxy room is rendered into 2D via camera projection and rasterization to yield dense semantic, instance, and depth maps, providing direct control tensors to a 2D diffusion model. These maps serve as a bridge, enabling the 2D generator to impose global consistency and geometric plausibility on synthesized 3D rooms.
In ALPI, synthetic 3D proxy cuboids are generated from 2D bounding boxes and class-conditional size priors, placed within LiDAR frustums. Full 3D annotations for these proxies then supervise a frustum-based 3D detector under mixed 2D–3D loss, while "real" instances are supervised only by 2D projections (Lahlali et al., 2024).
c) Proxy Meshes, Shells, and UV Scattering
Bringing Your Portrait to 3D Presence introduces a Dual-UV proxy representation. Here, each image is mapped into a canonical UV space via deterministic rasterization and feature scattering based on a tracked human mesh, with separate branches for on-surface (Core-UV) and off-surface (Shell-UV) regions (Zhang et al., 27 Nov 2025). The proxy mesh, constructed from multi-expert estimators and visibility correction, ensures that feature tokens are spatially and semantically consistent in both 2D image space and 3D geometric space.
d) Specialized Embeddings for Conditioned Generation
BAG constructs four orthogonal "XYZ-map" images by projecting the 3D body mesh into canonical coordinates, storing per-pixel 3D positions as normalized RGB values (Luo et al., 27 Jan 2025). These images act as spatially aligned proxies, conditioning ControlNet-based multiview diffusion models to generate conformant wearable assets for the target body.
3DProxyImg aligns a partial depth-inferred point cloud with a generative mesh, sparsifies it to proxy vertices, and stores per-vertex learned features. Rendering is performed by projecting these 3D proxies into 2D for each view and decoding with an MLP, thus separating geometry control in 3D from appearance synthesis in 2D (Zhu et al., 17 Dec 2025).
3. Learning Objectives and Alignment Mechanisms
The central mechanism in 2D–3D aligned proxy representations is the enforcement of consistent geometric and/or semantic structure across modalities via tailored losses or architectural constraints.
- Contrastive objectives: InfoNCE, cosine similarity, or cross-entropy losses are used to align feature embeddings (e.g., 3D encoder output vs. 2D CLIP embedding, SITA-VAE's triplet contrastive terms) (Huang et al., 2023, Zhao et al., 2023, Zhou et al., 2023).
- Cross-modal decoders: Joint masked autoencoders reconstruct masked portions of both 2D images and 3D point clouds, with attention masks or cross-reconstruction losses (e.g., enforcing that a reconstructed point cloud projects back into consistent depth/appearance images) (Guo et al., 2023).
- Geometry-alignment penalties: Masked depth alignment, normal preservation, and inpainting regularizers ensure that generated 3D structures fit within the constraints of user-edited or algorithmically specified semantic proxies (Schult et al., 2023).
- Proxy-injection and iterative pseudo-labeling: In weak-labeling scenarios, synthetic proxies with known geometry are injected into training streams to provide high-fidelity 3D supervision and enable iterative refinement via pseudo-labeling (Lahlali et al., 2024).
4. Applications and Impact across Modalities and Tasks
2D–3D aligned proxy representations have catalyzed progress across a range of vision and graphics domains:
| Domain/Task | Example Proxy | Mechanism |
|---|---|---|
| Language-guided 3D segmentation | Pixel–voxel pairs | CLIP-aligned contrastive losses, TQM (Huang et al., 2023) |
| Multimodal 3D shape generation | Latent triplets | SITA-VAE/ASLDM: aligned latent diffusion (Zhao et al., 2023) |
| Scene/room synthesis | Semantic boxes | Rasterized control maps, 2D LDM conditioning (Schult et al., 2023) |
| Wearable asset generation | Canonical XYZ maps | Dense 2D–3D alignment for control, diffusion (Luo et al., 27 Jan 2025) |
| Instance/video editing | Proxy mesh/3DGS | Rigid/nonrigid mesh, ARAP, 2D projection (Liu et al., 27 Jun 2025, Xie et al., 8 Jul 2025) |
| Human/avatar reconstruction | Dual-UV meshes | UV scattering from multi-expert SMPL-X tracking (Zhang et al., 27 Nov 2025) |
Such proxies enable zero-shot or few-shot transfer, physically-plausible instance editing, robust 3D annotation from weak 2D labels, cross-modal retrieval, and truly joint 2D–3D representation learning. Quantitatively, their use has led to significant boosts in semantic segmentation (+4.0 mIoU on S3DIS (Huang et al., 2023)), open-set classification (e.g., Uni3D-g, +33% top-1 Objaverse-LVIS vs prior models (Zhou et al., 2023)), and geometry fidelity in 3D generation (e.g., SITA-VAE: IoU 0.966 vs. 0.955 for baseline (Zhao et al., 2023)).
5. Advantages, Ablations, and Limitations
Empirical ablations across frameworks consistently show that:
- Proxy alignment is necessary for effective cross-modal transfer; e.g., TQM without contrastive 2D–3D pre-training yields no real gain; proxy losses or synthetic injections close much of the gap to fully-supervised 3D (Huang et al., 2023, Lahlali et al., 2024).
- Rich data structures and explicit geometry in proxies anchor the ill-posedness of weak or ambiguous supervisories.
- Proxy approaches are robust to noisy priors and quantization, with iterative refinement schemes (pseudo-labels, Sim(3) alignment) further closing remaining performance gaps.
However, deterministic proxy-based pipelines can introduce biases when the original proxy mapping is limited (e.g., 2D→3D gesture lifting degrades naturalness compared to direct 3D generation (Guichoux et al., 2024)), or when proxy geometry is inaccurate / overconstrained (e.g., mesh misalignment in portrait reconstructions (Zhang et al., 27 Nov 2025)).
6. Extensions and Future Directions
Recent trends point toward generalization of proxy representation principles:
- Extension to time and temporal consistency by embedding 3D proxies as time-indexed tracks (e.g., motion-editable video proxy meshes (Liu et al., 27 Jun 2025)).
- Data-driven synthetic proxy generation for domains lacking dense annotations, and hybrid real–synthetic training manifolds leveraging photorealistic and geometry-consistent branches (Zhang et al., 27 Nov 2025).
- Application to controllable editing in the wild, leveraging advances in 3D-aware generators and diffrentiable inpainting guided by proxy geometry (Xie et al., 8 Jul 2025).
Future research is likely to focus on increasing the fidelity of proxies, improving weak-/self-supervision modes, minimizing information loss during projection, and designing proxy representations robust to noisy, sparse, or occluded data. These advances will further consolidate the central role of 2D–3D aligned proxy representations in bridging modalities for geometry-aware perception, generation, and interaction.