Papers
Topics
Authors
Recent
Search
2000 character limit reached

A High-Fidelity Digital Twin for Robotic Manipulation Based on 3D Gaussian Splatting

Published 6 Jan 2026 in cs.RO | (2601.03200v1)

Abstract: Developing high-fidelity, interactive digital twins is crucial for enabling closed-loop motion planning and reliable real-world robot execution, which are essential to advancing sim-to-real transfer. However, existing approaches often suffer from slow reconstruction, limited visual fidelity, and difficulties in converting photorealistic models into planning-ready collision geometry. We present a practical framework that constructs high-quality digital twins within minutes from sparse RGB inputs. Our system employs 3D Gaussian Splatting (3DGS) for fast, photorealistic reconstruction as a unified scene representation. We enhance 3DGS with visibility-aware semantic fusion for accurate 3D labelling and introduce an efficient, filter-based geometry conversion method to produce collision-ready models seamlessly integrated with a Unity-ROS2-MoveIt physics engine. In experiments with a Franka Emika Panda robot performing pick-and-place tasks, we demonstrate that this enhanced geometric accuracy effectively supports robust manipulation in real-world trials. These results demonstrate that 3DGS-based digital twins, enriched with semantic and geometric consistency, offer a fast, reliable, and scalable path from perception to manipulation in unstructured environments.

Summary

  • The paper demonstrates that 3D Gaussian Splatting enables rapid, photorealistic scene reconstruction combined with visibility-aware semantic fusion for effective manipulation.
  • It introduces a multi-stage geometry cleaning pipeline that converts noisy point clouds into watertight meshes, significantly enhancing collision-aware planning.
  • Experimental results validate improved reconstruction speed, semantic accuracy, and real-world robotic manipulation success through integrated digital twin simulation.

High-Fidelity Digital Twin Construction for Robotic Manipulation via 3D Gaussian Splatting

Introduction and Motivation

The paper "A High-Fidelity Digital Twin for Robotic Manipulation Based on 3D Gaussian Splatting" (2601.03200) addresses key challenges in constructing actionable digital twins for robotic manipulation, focusing on the trade-off between visual fidelity, reconstruction speed, and physical utility. Prior digital twin pipelines relying on mesh or point cloud reconstruction exhibit limitations in geometric detail, and neural radiance fields (NeRFs) provide high fidelity at the cost of impractical latency for closed-loop robotic workflows.

This work leverages 3D Gaussian Splatting (3DGS) as a unified scene representation that is capable of photorealistic, rapid reconstruction from sparse RGB views. The core methodological innovation lies in enhancing 3DGS models with visibility-aware semantic fusion for reliable object identification, coupled with a multi-stage geometry cleaning pipeline that enables conversion to planning-ready, watertight mesh suitable for physics-based motion planning and real robot execution. Figure 1

Figure 1: End-to-end pipeline: multi-view video input and 3DGS for scene reconstruction, Grounded-SAM for semantic mask lifting, and robust projection for digital twin formation enabling collision-aware robot planning.

The framework advances existing literature on scene reconstruction and digital twins by addressing the bottlenecks in prior methods. TSDF and voxel-based pipelines, while robust for navigation, lack necessary surface detail for manipulation. Neural representations such as NeRF and Instant-NGP deliver superior view synthesis but prohibit direct integration into planning due to ambiguous, non-watertight geometry and high computational demands. Existing 3DGS-based efforts have begun bridging this gap, e.g. Splat-Nav [chen2025splatnav], RoboGSim [li2025robogsimreal2sim2realroboticgaussian], yet most lack robust mechanisms for cleaning and semantic labeling required for manipulation-level fidelity.

The integration of foundation models (SAM, Grounded-SAM) for mask generation enables zero-shot, open vocabulary semantic segmentation, but naive lifting of 2D predictions creates semantic ambiguity and geometric noise. The presented visibility-aware semantic fusion approach systematizes multi-view geometric hypothesis filtering and achieves spatially reliable 3D object identification—an essential step for manipulation-centric digital twins.

Methodological Contributions

Rapid 3DGS Scene Reconstruction

Scene geometry is reconstructed via an optimized 3DGS pipeline. Each Gaussian primitive encodes spatial position, covariance, color, and opacity, providing a photorealistic, differentiable representation. For robust initialization under sparse views, the InstantSplat [fan2025instantsplatsparseviewgaussiansplatting] technique bypasses SfM by leveraging pretrained geometric priors for initial camera and point cloud estimation. Joint optimization proceeds to minimize multi-view photometric error rapidly, attaining high fidelity within minutes.

Visibility-Aware Semantic Lifting

Semantic lifting utilizes 2D object masks from vision foundation models (SAM, Grounded-SAM). To address mask "bleeding" and uncertain spatial assignment, the framework projects each 2D mask into 3D, clusters projected 3D points by depth using DBSCAN (density-based spatial clustering), and applies confidence-weighted voting. Only points within the largest, spatially consistent depth cluster receive high semantic confidence; spatial outliers are suppressed, yielding robust 3D semantic labeling even under occlusion or mask ambiguity. Iterative boundary refinement via KNN label smoothing ensures continuous, actionable semantic surfaces.

Point Cloud Cleaning and Mesh Generation

Raw 3DGS point clouds contain noise ("floaters," "ghost points," fuzzy boundaries) detrimental to collision checking and planning. The geometry cleaning is executed in three stages:

  1. Attribute Filtering: Low-opacity and anomalously stretched primitives are culled.
  2. Semantic-Guided Connectivity Pruning: DBSCAN-based cluster analysis retains only the dominant, connected geometric components per semantic partition, excising detached noise artifacts.
  3. Alpha Shapes Meshing: A shrink-wrapping algorithm tightly fits a watertight mesh over cleaned points, preserving sharp geometric features necessary for stable grasping. Figure 2

    Figure 2: Top shows raw point clouds with floaters and noise; bottom illustrates geometry after multi-stage filtering, suitable for planning.

Simulation and Planning Integration

The processed geometries are ingested into a physically accurate Unity-based digital twin. Semantic partitions map to collision geometry and rigid bodies for manipulation, with ROS2/Unity interface enabling real-time bidirectional state synchronization. MoveIt 2 leverages the cleaned mesh in collision-aware planning pipelines. Figure 3

Figure 3

Figure 3

Figure 3: Unity digital twin illustrating high fidelity scene rendering and physical simulation.

Experimental Validation and Quantitative Results

Experiments utilize a Franka Emika Panda robotic arm with end-effector RGB-D sensing for scene acquisition. Eight objects spanning basic (convex, Lambertian), complex (thin, nonconvex), and highly textured categories are evaluated across metrics of reconstruction speed, visual fidelity (PSNR/SSIM), semantic segmentation accuracy, geometric cleaning efficacy, and real-world manipulation success.

  • Reconstruction: 3DGS attains average scene reconstruction in 229 s, delivering 37.03 dB PSNR and 0.9821 SSIM—significantly outperforming NeRFs (1123 s, 28.33 dB, 0.9037).
  • Semantic segmentation: Achieves 93.72% multi-view consistency under optimal voting thresholds, mIoU of 0.87 (2D) and 0.93 (3D).
  • Geometry cleaning ablation: Combined attribute/clustering pipeline reduces Chamfer distance to 0.0020, elevating F1-score to 0.9989.
  • Simulation-to-reality transfer: 100% planning success in the digital twin, 90% realized manipulation success on robot, zero collisions, average placement error below 1 cm. Figure 4

Figure 4

Figure 4: Real robot execution sequence validates the transfer from digital twin motion plan to physical manipulation.

Implications and Future Research Directions

This pipeline fundamentally reduces the perception-to-action bottleneck in robotic manipulation, delivering high-fidelity, semantically structured digital twins that are planning-ready and validated for real robot execution. Its utility is evidenced in zero-shot, multi-step rearrangement tasks in cluttered environments. The robustness of the cleaning process (two-stage denoising and meshing) establishes repeatable, high-precision geometric foundations for manipulation.

Future work should focus on (1) extending the pipeline to handle dynamic, non-static scenes using temporal 3DGS and online updating [zhou2023drivinggaussian], (2) enriching representations with online physical property estimation for complex interaction [cherian2024llmphycomplexphysicalreasoning], and (3) integrating learned, end-to-end grasp planning directly on gaussian representations [ji2024graspsplats]. The tightly integrated digital twin creates a pathway for robust policy learning via simulated large-scale data generation, facilitating scalable robotics research and deployment with closed-loop safety guarantees.

Conclusion

The presented framework demonstrates a practical solution to sim-to-real transfer for robotic manipulation by unifying rapid, photorealistic scene reconstruction (via 3DGS) with robust semantic and geometric processing. Rigorous experimental validation confirms strong performance across all critical metrics including scene fidelity, semantic consistency, geometry cleanliness, and transfer efficacy to real robot platforms. The architecture establishes a new baseline for digital twin-based manipulation, and lays groundwork for fully autonomous, adaptive robotic systems in unstructured environments.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.