Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping

Published 4 Feb 2026 in cs.RO, cs.AI, cs.CV, and cs.LG | (2602.05029v1)

Abstract: Operating effectively in novel real-world environments requires robotic systems to estimate and interact with previously unseen objects. Current state-of-the-art models address this challenge by using large amounts of training data and test-time samples to build black-box scene representations. In this work, we introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping without relying on any additional 3D data or test-time samples. Our model solves a series of constrained optimization problems to estimate physically consistent scene parameters, such as meshes, lighting conditions, material properties, and 6D poses of previously unseen objects from a single RGBD image and bounding boxes. We evaluated our approach on standard model-free few-shot benchmarks and demonstrated that it outperforms existing algorithms for model-free few-shot pose estimation. Furthermore, we validated the accuracy of our scene reconstructions by applying our algorithm to a zero-shot grasping task. By enabling zero-shot, physically-consistent scene reconstruction and grasping without reliance on extensive datasets or test-time sampling, our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel DIG framework that performs zero-shot scene reconstruction and robotic grasping without extensive 3D training data.
It employs a hybrid neural-physics model, using ellipsoidal approximation, mesh optimization, and differentiable rendering to refine scene parameters.
Experimental results on benchmarks like FewSOL and LINEMOD-OCCLUDED demonstrate an 89.28% success rate, surpassing traditional few-shot pose estimation methods.

Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping

Introduction and Background

The paper "Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping" (2602.05029) presents a novel approach to address the challenges of robotic systems operating in novel, unstructured environments. The methodology leverages differentiable inverse graphics (DIG) models to perform zero-shot scene reconstruction and robotic grasping, offering a robust framework that eliminates the need for extensive 3D training data or supplemental test-time samples.

The inability of traditional models that heavily depend on large training datasets and complex test-time samples to generalize efficiently in new environments highlights the need for efficient model induction. The DIG approach blends neural foundation models with physics-based differentiable rendering, tackling pose estimation and robotic manipulation with reduced reliance on additional data.

Figure 1: The system captures an RGBD image, segments it, refines using a renderer, and enables accurate grasping of novel objects.

Methodology

System Overview

The system initiated from an RGBD image employs a hybrid neural-physics model for scene reconstruction and robotic manipulation. The architecture begins with the approximation of object shapes using ellipsoidal primitives, which are then refined through mesh optimization. Subsequently, this produces a comprehensive scene representation, which aids in simulating and identifying optimal robotic grasping strategies.

Figure 2: Hybrid Neural-Physics pipeline for zero-shot scene reconstruction and manipulation.

Robust Optimization

Key innovations in this work include a probabilistic ellipsoid estimation technique that incorporates physically-consistent priors to handle sensor noise and an advanced differentiable ray-tracing engine. These elements enable the model to avoid local minima pitfalls and enhance robustness in the scene reconstruction process. This is accomplished through a coarse-to-fine strategy, setting a benchmark over existing few-shot pose estimation models.

A physics-based differentiable renderer enhances the system's interpretability and generalizability by providing fine-tuned physics parameters such as lighting and materials to refine the initial scene estimation.

Differentiable Rendering

The differentiable rendering pipeline, constructed using JAX primitives, serves as the backbone for optimizing scene parameters efficiently with GPU acceleration. The system notably resolves the zero-gradient issue in binary mask rendering with an innovative soft-mask function reliant on depth images, foregoing conventional pixel-mesh correspondences.

Figure 3: Differentiable rendering pipeline illustrating optimization from scene parameters to RGBD outputs.

Zero-shot Grasping Validation

In practical applications, the system demonstrates robust performance in zero-shot grasping tasks, achieving high success rates across a variety of objects without needing pre-training on specific datasets. The use of real-time differentiable rendering, combined with robust optimization strategies, provides a viable solution for deploying robotic systems in novel environments, avoiding the risks associated with deep-learning-induced hallucinations in high-stake scenarios.

Results and Comparative Analysis

The evaluation extends across several benchmarks, including FewSOL, MOPED, and LINEMOD-OCCLUDED datasets. The system consistently exhibits superior zero-shot pose estimation capabilities, compared to traditional methods such as Gen6D and OnePose++. The robustness against occlusions and clutter, alongside substantial improvements in reconstruction accuracy and execution time, underscores its effectiveness.

Figure 4: Zero-shot pose estimation results showing high accuracy across multiple datasets.

Early results indicate that the methodology achieves an 89.28% success rate in zero-shot grasping tasks within real robotic setups, significantly outperforming data-driven baselines.

Figure 5: The robust optimization process visualized across optimization steps, demonstrating precise scene reconstruction.

Conclusion

The advancement of differentiable inverse graphics techniques presented in this paper emphasizes data efficiency, enhanced interpretability, and broader applicability of robotic systems. Although optimization speed and reliance on bounding boxes remain areas for improvement, future work aims to incorporate faster inference mechanisms and more autonomous detection methodologies.

Ultimately, the paper showcases a promising direction for zero-shot learning and autonomous robotic manipulation, empowering machines to adeptly navigate and interact with unstructured environments with minimal prior information.

Markdown Report Issue