SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

Published 1 Dec 2025 in cs.CV and cs.RO | (2512.01629v2)

Abstract: Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling. Project page: https://heyumeng.com/SPARK/index.html.

Abstract PDF Upgrade to Chat

Summary

The paper introduces SPARK, which integrates VLM-based URDF estimation and a Diffusion Transformer for high-fidelity, simulation-ready 3D mesh generation.
It employs hierarchical attention and differentiable joint optimization to ensure coherent part-level geometry and accurate kinematic structures.
Comparative evaluations reveal lower Chamfer Distance and higher F-Score, outperforming state-of-the-art methods in articulated object reconstruction.

Articulated Reconstruction with SPARK Framework

The paper "SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge" introduces a novel framework designed to reconstruct simulation-ready, articulated 3D objects from a single RGB image. SPARK leverages Vision-LLMs (VLMs) to infer coarse Unified Robot Description Format (URDF) parameters and employs a Diffusion Transformer (DiT) to synthesize high-quality articulated meshes with realistic textures. The framework integrates part-image guidance with hierarchical attention mechanisms to ensure kinematic consistency and enhances URDF parameter estimation through differentiable forward kinematics and rendering.

Methodological Framework

SPARK's approach commences with VLM-guided parsing to extract initial URDF parameters and part-level images. These parameters include the types and attributes of joints and links, providing a structural prior that informs subsequent mesh synthesis. The paper details a comprehensive pipeline that integrates these VLM-derived priors with a generative diffusion process facilitated by a DiT, accomplishing simultaneous part-level and complete object reconstruction.

Figure 1: Pipeline Overview. The SPARK framework integrates VLM and DiT to reconstruct and texture articulated meshes from a single image, optimizing URDF parameters via differentiable techniques.

The hierarchical attention mechanisms employed in SPARK are pivotal. They enable the accurate synthesis of part-level geometries by incorporating both local and global visual cues, ensuring that the assembled 3D objects maintain structural coherence and accurate articulation. This attention to kinematic detail is critical for producing simulation-compatible assets.

Comparative Evaluation

The paper provides extensive quantitative evaluations comparing SPARK with existing state-of-the-art methods in part-level 3D object generation and URDF estimation. The results demonstrate SPARK's superior performance in both domains, showcasing its capability to produce physically consistent and high-fidelity articulated objects.

Figure 2: Qualitative Comparison on Shape Reconstruction. SPARK outperforms other methods in detail and fidelity, as illustrated in comparisons with OmniPart and PartCrafter.

Figure 3: Qualitative Comparison on URDF Estimation. The framework achieves more accurate URDF parameters compared to baselines, enabling realistic articulated motions.

SPARK's performance is highlighted by its lower Chamfer Distance and higher F-Score in geometric metrics, indicative of its precise reconstruction capabilities. Additionally, the framework effectively reduces AxisErr and PivotErr in URDF estimation, underscoring its accuracy in joint parameter retrieval.

Technical Contributions and Innovations

Several technical contributions distinguish SPARK from prior approaches:

VLM Integration: The use of VLMs for initial URDF parameter estimation allows SPARK to leverage semantic understanding for structural parsing, setting a foundation for informed part synthesis.
Hierarchical Attention in DiT: By structuring the attention mechanism to accommodate both local part details and overarching global context, SPARK ensures coherent mesh generation that adheres to inferred kinematic structures.
Differentiable Joint Optimization: Refining joint parameters through differentiable forward kinematics and rendering ensures that generated objects not only look correct but also function accurately in simulated environments.
Generative Texture Modeling: The framework extends its capabilities to texture generation, providing visually realistic models that are ready for immediate use in robotics and simulation applications.

Implications and Future Directions

The implications of this research are significant for fields requiring detailed articulated models, such as robotics and virtual reality. By automating the reconstruction of complex 3D assets from simple images, SPARK can significantly reduce the labor and expertise traditionally required for such tasks.

Figure 4: In-the-wild results displaying SPARK's performance on diverse images, reinforcing its robustness and applicability to real-world scenarios.

Looking forward, the research posits extending SPARK to accommodate more complex kinematic structures, such as multi-degree-of-freedom joints and compound mechanisms, which are prevalent in real-world machinery and devices. This would further enhance the utility of SPARK in practical engineering and robotics applications.

Conclusion

SPARK represents a significant advancement in the automatic generation of articulated 3D objects from single images. Its integration of VLM priors with powerful generative models and optimization techniques addresses the dual challenges of high-fidelity shape reconstruction and accurate kinematic parameter estimation. This framework positions itself as a formidable tool for developing interactive and embodied AI systems, with potential expansions promising even broader applicability and impact.

Markdown Report Issue