- The paper demonstrates that 3D Gaussian Splatting enables rapid, photorealistic scene reconstruction combined with visibility-aware semantic fusion for effective manipulation.
- It introduces a multi-stage geometry cleaning pipeline that converts noisy point clouds into watertight meshes, significantly enhancing collision-aware planning.
- Experimental results validate improved reconstruction speed, semantic accuracy, and real-world robotic manipulation success through integrated digital twin simulation.
High-Fidelity Digital Twin Construction for Robotic Manipulation via 3D Gaussian Splatting
Introduction and Motivation
The paper "A High-Fidelity Digital Twin for Robotic Manipulation Based on 3D Gaussian Splatting" (2601.03200) addresses key challenges in constructing actionable digital twins for robotic manipulation, focusing on the trade-off between visual fidelity, reconstruction speed, and physical utility. Prior digital twin pipelines relying on mesh or point cloud reconstruction exhibit limitations in geometric detail, and neural radiance fields (NeRFs) provide high fidelity at the cost of impractical latency for closed-loop robotic workflows.
This work leverages 3D Gaussian Splatting (3DGS) as a unified scene representation that is capable of photorealistic, rapid reconstruction from sparse RGB views. The core methodological innovation lies in enhancing 3DGS models with visibility-aware semantic fusion for reliable object identification, coupled with a multi-stage geometry cleaning pipeline that enables conversion to planning-ready, watertight mesh suitable for physics-based motion planning and real robot execution.
Figure 1: End-to-end pipeline: multi-view video input and 3DGS for scene reconstruction, Grounded-SAM for semantic mask lifting, and robust projection for digital twin formation enabling collision-aware robot planning.
Related Work Contextualization
The framework advances existing literature on scene reconstruction and digital twins by addressing the bottlenecks in prior methods. TSDF and voxel-based pipelines, while robust for navigation, lack necessary surface detail for manipulation. Neural representations such as NeRF and Instant-NGP deliver superior view synthesis but prohibit direct integration into planning due to ambiguous, non-watertight geometry and high computational demands. Existing 3DGS-based efforts have begun bridging this gap, e.g. Splat-Nav [chen2025splatnav], RoboGSim [li2025robogsimreal2sim2realroboticgaussian], yet most lack robust mechanisms for cleaning and semantic labeling required for manipulation-level fidelity.
The integration of foundation models (SAM, Grounded-SAM) for mask generation enables zero-shot, open vocabulary semantic segmentation, but naive lifting of 2D predictions creates semantic ambiguity and geometric noise. The presented visibility-aware semantic fusion approach systematizes multi-view geometric hypothesis filtering and achieves spatially reliable 3D object identification—an essential step for manipulation-centric digital twins.
Methodological Contributions
Rapid 3DGS Scene Reconstruction
Scene geometry is reconstructed via an optimized 3DGS pipeline. Each Gaussian primitive encodes spatial position, covariance, color, and opacity, providing a photorealistic, differentiable representation. For robust initialization under sparse views, the InstantSplat [fan2025instantsplatsparseviewgaussiansplatting] technique bypasses SfM by leveraging pretrained geometric priors for initial camera and point cloud estimation. Joint optimization proceeds to minimize multi-view photometric error rapidly, attaining high fidelity within minutes.
Visibility-Aware Semantic Lifting
Semantic lifting utilizes 2D object masks from vision foundation models (SAM, Grounded-SAM). To address mask "bleeding" and uncertain spatial assignment, the framework projects each 2D mask into 3D, clusters projected 3D points by depth using DBSCAN (density-based spatial clustering), and applies confidence-weighted voting. Only points within the largest, spatially consistent depth cluster receive high semantic confidence; spatial outliers are suppressed, yielding robust 3D semantic labeling even under occlusion or mask ambiguity. Iterative boundary refinement via KNN label smoothing ensures continuous, actionable semantic surfaces.
Point Cloud Cleaning and Mesh Generation
Raw 3DGS point clouds contain noise ("floaters," "ghost points," fuzzy boundaries) detrimental to collision checking and planning. The geometry cleaning is executed in three stages:
- Attribute Filtering: Low-opacity and anomalously stretched primitives are culled.
- Semantic-Guided Connectivity Pruning: DBSCAN-based cluster analysis retains only the dominant, connected geometric components per semantic partition, excising detached noise artifacts.
- Alpha Shapes Meshing: A shrink-wrapping algorithm tightly fits a watertight mesh over cleaned points, preserving sharp geometric features necessary for stable grasping.
Figure 2: Top shows raw point clouds with floaters and noise; bottom illustrates geometry after multi-stage filtering, suitable for planning.
Simulation and Planning Integration
The processed geometries are ingested into a physically accurate Unity-based digital twin. Semantic partitions map to collision geometry and rigid bodies for manipulation, with ROS2/Unity interface enabling real-time bidirectional state synchronization. MoveIt 2 leverages the cleaned mesh in collision-aware planning pipelines.


Figure 3: Unity digital twin illustrating high fidelity scene rendering and physical simulation.
Experimental Validation and Quantitative Results
Experiments utilize a Franka Emika Panda robotic arm with end-effector RGB-D sensing for scene acquisition. Eight objects spanning basic (convex, Lambertian), complex (thin, nonconvex), and highly textured categories are evaluated across metrics of reconstruction speed, visual fidelity (PSNR/SSIM), semantic segmentation accuracy, geometric cleaning efficacy, and real-world manipulation success.
- Reconstruction: 3DGS attains average scene reconstruction in 229 s, delivering 37.03 dB PSNR and 0.9821 SSIM—significantly outperforming NeRFs (1123 s, 28.33 dB, 0.9037).
- Semantic segmentation: Achieves 93.72% multi-view consistency under optimal voting thresholds, mIoU of 0.87 (2D) and 0.93 (3D).
- Geometry cleaning ablation: Combined attribute/clustering pipeline reduces Chamfer distance to 0.0020, elevating F1-score to 0.9989.
- Simulation-to-reality transfer: 100% planning success in the digital twin, 90% realized manipulation success on robot, zero collisions, average placement error below 1 cm.

Figure 4: Real robot execution sequence validates the transfer from digital twin motion plan to physical manipulation.
Implications and Future Research Directions
This pipeline fundamentally reduces the perception-to-action bottleneck in robotic manipulation, delivering high-fidelity, semantically structured digital twins that are planning-ready and validated for real robot execution. Its utility is evidenced in zero-shot, multi-step rearrangement tasks in cluttered environments. The robustness of the cleaning process (two-stage denoising and meshing) establishes repeatable, high-precision geometric foundations for manipulation.
Future work should focus on (1) extending the pipeline to handle dynamic, non-static scenes using temporal 3DGS and online updating [zhou2023drivinggaussian], (2) enriching representations with online physical property estimation for complex interaction [cherian2024llmphycomplexphysicalreasoning], and (3) integrating learned, end-to-end grasp planning directly on gaussian representations [ji2024graspsplats]. The tightly integrated digital twin creates a pathway for robust policy learning via simulated large-scale data generation, facilitating scalable robotics research and deployment with closed-loop safety guarantees.
Conclusion
The presented framework demonstrates a practical solution to sim-to-real transfer for robotic manipulation by unifying rapid, photorealistic scene reconstruction (via 3DGS) with robust semantic and geometric processing. Rigorous experimental validation confirms strong performance across all critical metrics including scene fidelity, semantic consistency, geometry cleanliness, and transfer efficacy to real robot platforms. The architecture establishes a new baseline for digital twin-based manipulation, and lays groundwork for fully autonomous, adaptive robotic systems in unstructured environments.