PhysXGen Dual-Branch Framework
- PhysXGen is a dual-branch framework that explicitly decouples 3D geometry from physical properties to generate assets with both visual and simulation-ready fidelity.
- It utilizes a parallel VAE architecture and latent diffusion transformer to fuse structural and physical cues, ensuring realistic per-voxel annotations.
- The framework outperforms conventional models with improved metrics and supports applications in robotic manipulation, digital twins, and embodied AI.
The PhysXGen dual-branch framework is a feed-forward architecture designed to generate 3D assets grounded not only in geometric and appearance fidelity but also annotated with plausible physical properties. PhysXGen addresses limitations in conventional 3D generative models, which predominantly emphasize geometric structure and visual texture, by explicitly disentangling the prediction of 3D structure from the modeling of physical asset attributes. The pipeline employs a dual-branch variational autoencoder (VAE) and a latent diffusion transformer, fusing physical knowledge into pre-trained geometric representations and enabling per-voxel prediction across multiple physics dimensions. This method generates physically annotated 3D meshes from single RGB images, producing outputs suitable for downstream physical simulation and embodied AI applications (Cao et al., 16 Jul 2025).
1. Pipeline Overview and Input Modalities
PhysXGen processes a single RGB input image depicting a real-world object, extracting three distinct types of feature representations:
- Structural features (): Generated via a frozen DINOv2 backbone applied to multi-view mesh renderings, capturing geometry and appearance.
- Textual/Semantic embeddings (): Obtained with CLIP for basic, functional, and kinematic descriptions.
- Physical property vectors (): Concatenate per-voxel scale (), affordance priority (), density (), and kinematics ( with parent/child indices, movement direction, location, range, type).
Following feature extraction, encoding and decoding are handled by two parallel branches (structural and physical), processed through VAEs and a dual-branch latent diffusion transformer.
2. Dual-Branch VAE Architecture
PhysXGen’s architecture comprises two main branches:
- Structural Branch: Utilizes the pre-trained TRELLIS VAE encoder () on voxel features. The bottleneck latent dimension is $8$. Decoding employs $12$ transformer blocks ($12$ heads per block, , ), producing a density grid and color grid at a spatial resolution of voxels. Only adapter layers (1×1 conv + GeLU) are trainable; TRELLIS weights remain frozen except at places conditioned on physics signals.
- Physical-Property Branch: The encoder () fuses and into an $8$-dim latent per voxel using $4$ transformer blocks ($12$ heads, , ). The decoder () reconstructs $14$-channel physical grids and semantic grids with $4$ transformer blocks ($16$ heads, ).
A residual path from the physical decoder injects physical feature maps into corresponding layers of the structural decoder, enabling physics-informed geometry decoding.
3. Latent Fusion and Diffusion Generation
PhysXGen models the correlation and interplay between geometry and physics via two mechanisms:
- Residual Adapters in VAE: Physical maps from are projected into the intermediate layers of , conditioning geometric generation on physics.
- Dual-Branch Latent Diffusion: Two parallel U-Net transformer streams represent structural and physical latents. Within diffusion blocks, frozen structural activations are linearly projected into the physical stream via skip-connections, ensuring the geometry prior informs physics field generation.
Sampling the diffusion model yields paired latent representations , which are decoded into textured mesh and per-voxel physics fields.
4. Training Objectives and Loss Functions
PhysXGen is optimized with compound losses:
- VAE Objective:
- : Sum of color prediction error, geometry grid error (mask, normals, density), and KL divergence on latents.
- : losses over predicted and ground-truth physics attributes and semantic grids.
- : Latent alignment loss, enforcing proximity between projected structural and physical latents in a shared subspace.
- Latent Diffusion Objective:
where are score networks for the structural and physical branches.
This training scheme is designed to ensure both high-fidelity geometry and physically plausible annotations, with fusion losses driving joint representation.
5. Injection of Physical Knowledge and Pre-training Strategy
PhysXGen initiates from frozen TRELLIS encoder and decoder weights for the structural branch, while inserting lightweight adapter layers into the decoder blocks. Each adapter receives feature maps from , and only these adapters and the physical VAE components are fine-tuned. The bulk of the geometric decoder remains unchanged, preserving previously acquired geometric priors while enabling conditioning on specific physical features. This scheme supports efficient transfer of physical annotation capabilities onto pre-existing geometric models.
6. Experimental Protocols, Evaluation Metrics, and Results
PhysXGen is trained and validated using the PhysXNet dataset, which is systematically annotated across absolute scale, material, affordance, kinematics, and function description. The model is trained on 24K instances, with separate validation and test splits (1K each), optimized via AdamW (learning rate , weight decay , batch size ), utilizing NVIDIA A100 GPUs.
Evaluation metrics include:
- Geometry/Appearance: PSNR (higher is better), Chamfer Distance (, lower is better), and F-score (, higher is better).
- Physical Properties: MAE for absolute scale, material density, affordance rank, kinematic parameters, and cosine similarity for function descriptions (all lower is better).
A comparative summary is provided:
| Method | PSNR ↑ | CD ↓ | F-Score ↑ | Scale ↓ | Mat. ↓ | Afford. ↓ | Kin. ↓ | Func. Desc. ↓ |
|---|---|---|---|---|---|---|---|---|
| TRELLIS | 24.31 | 13.2 | 76.9 | – | – | – | – | – |
| TRELLIS+PhysPre | 24.31 | 13.2 | 76.9 | 12.46 | 0.262 | 0.435 | 0.589 | 1.01 |
| PhysXGen | 24.53 | 12.7 | 77.3 | 6.63 | 0.141 | 0.372 | 0.479 | 0.71 |
Ablation experiments demonstrate that enabling latent correlation in both VAE and diffusion modules yields further improvements in physical alignment, geometry fidelity, and appearance metrics. The framework produces meshes with geometry aligned to input views and per-voxel physics fields (heatmaps for scale, density, affordance ranking, joint axes) with quantitative alignment to ground truth outperforming prior baselines (Cao et al., 16 Jul 2025).
7. Significance and Implications
PhysXGen establishes an explicit disentanglement of structure and physics latents, integrating physical knowledge into pre-existing generative 3D representations through residual fusion and cross-attention mechanisms. This enables end-to-end physical-grounded 3D asset generation, yielding assets suitable for real-world simulation and embodied AI tasks that require accurate physical annotations. A plausible implication is that dual-branch learning frameworks such as PhysXGen may catalyze progress in domains where physical properties are critical, such as robot manipulation, digital twin creation, and simulation-centric design tasks. The method’s release of code and data aims to facilitate ongoing research in generative physical AI.