3D-GRAND: 3D Dataset & Multi-Agent SLAM
- 3D-GRAND is a dual-framework encompassing a large-scale, instruction-tuned dataset for 3D-LLMs and an extension of multi-agent SLAM using Gaussian splatting.
- The dataset features 40,087 3D scenes with 6.2 million instruction pairs, achieving up to 95% grounding accuracy while significantly reducing hallucination through detailed validation.
- The multi-agent SLAM component leverages submap tracking, loop closures, and pose-graph optimization to boost indoor rendering fidelity and reduce trajectory errors in large-scale environments.
3D-GRAND encompasses two distinct but thematically related research contributions spanning large-scale datasets for 3D language-vision models and advanced multi-agent SLAM systems based on Gaussian splatting. In context, "3D-GRAND" refers to: (i) a million-scale dataset and benchmarking suite for instruction-tuned 3D-LLMs targeting robust 3D grounding and hallucination reduction (Yang et al., 2024); and (ii) an abbreviation for the 3D extension of GRAND-SLAM—an architecture for scalable, globally consistent, multi-agent Gaussian SLAM (&&&1&&&).
1. 3D-GRAND: Definition and Scope
3D-GRAND, as introduced in (Yang et al., 2024), is a large-scale dataset specifically designed to bridge language and 3D perception for embodied AI. It contains 40,087 richly tagged household scenes (mainly from 3D-FRONT and Structured3D) and 6.2 million densely-grounded scene-language instruction pairs. The dataset enables instruction-tuning of 3D LLMs (3D-LLMs), supporting dense and faithful grounding of noun phrases to 3D objects, and provides a rigorous evaluation benchmark (3D-POPE) for systematic hallucination assessment.
In the context of multi-agent SLAM, "3D-GRAND" (as an Editor's term: "Gaussian Reconstruction via Multi-Agent Dense SLAM") refers to an extension of the GRAND-SLAM framework for collaborative scene reconstruction using 3D Gaussian splatting, tailored for both indoor and large-scale outdoor environments (Thomas et al., 23 Jun 2025).
2. Dataset Construction and Statistics
The 3D-GRAND dataset construction pipeline in (Yang et al., 2024) consists of several distinct stages:
- Scene Collection: 38,000+ scenes are drawn from the 3D-FRONT and Structured3D repositories, featuring diverse room types. Meshes/layouts are rendered into colored point clouds via multi-view fusion (Blender-based pipeline), with spatial separation enforced by layout masks.
- Densely-grounded Annotation: For each scene, the following procedure is executed:
- Object Extraction: 2D object crops are generated; if ground truth is absent, "set-of-mark" GPT-4V prompts are used.
- Attribute Tagging: Object category, color, finish, and texture are inferred using open-vocabulary GPT-4V processing.
- Scene-Graph Assembly: Each object is represented as a JSON entry with (category, centroid , extents , attributes).
- Instruction Generation: GPT-4/GPT-4-Turbo templates produce eight classes of scene-language tasks, including grounded object reference, scene description, question answering (existence, spatial, count), and multi-turn or landmark-driven variants.
- Filtering and Augmentation: Low-quality or hallucinated outputs are filtered, and phrase-to-object tags are augmented via prompt-based approaches.
- Human Validation: 10,200 scene-annotation pairs are validated by three raters per sample on Hive.ai ( ≥2/3 agreement); text truthfulness is 85–88% and grounding accuracy is 92–95%.
The key descriptive statistics are summarized below:
| Attribute | Value |
|---|---|
| Scenes | 40,087 |
| Instruction pairs | 6.20 million |
| Avg. pairs per scene | 155 |
| Dense grounding | 100% of noun phrases → 3D objects |
| Tasks/splits | 8 task variants |
| Avg. objects per scene | 86.4 |
| Unique categories | 182 |
| Real-scan test set | 1,400 ScanNet scenes (zero-shot) |
This resource aims to provide explicit, high-quality grounding information critical for instruction-tuned 3D-LLMs.
3. Model Architectures, Training, and Losses
The primary instruction-tuned 3D-LLM architecture described in (Yang et al., 2024) consists of:
- Backbone: Llama-2 (70B parameters), equipped with LoRA adapters for efficient fine-tuning.
- Input Fusion: Object-centric 3D scene-graph descriptions prepended as JSON tokens (e.g.,
<obj id=12,cat=chair,x=1.2,...>). - Query Encoding: Instructions or questions appended to the textual context.
- Output: Free-form response with explicit inline grounding tags (e.g.,
<ground obj=12>).
The training objective incorporates three loss terms:
- : Standard cross-entropy on the target sequence.
- : Negative log-likelihood between phrase-object pairs; if phrase grounds to object .
- : BCE loss to penalize hallucinated object mentions in negative examples. Co-efficients: , .
Explicit phrase-to-object tags are crucial for minimizing hallucination and improving faithfulness in object reference.
4. 3D-POPE Benchmark and Quantitative Evaluation
3D-POPE (3D Probing Of Precision and Existence) is introduced as a benchmark for standardized, compositional hallucination evaluation in 3D-LLMs (Yang et al., 2024). The design comprises:
- Dataset: Derived from the ScanNet200 validation set (141 scenes, 200 classes).
- Task Format: Each sample is a (scene, object category Q, binary answer) triple, balanced 1:1 positive:negative.
- Negative Sampling: Random, popular (top- frequent absents), and adversarial (highest co-occurrence absent).
Evaluation metrics are standard: All models are tested zero-shot on ScanNet to ensure strict comparability.
Representative Results
| Model | Precision | Recall | F1 | Accuracy | Yes (%) |
|---|---|---|---|---|---|
| Random Baseline | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 |
| 3D-LLM | 50.03 | 99.88 | 66.67 | 50.07 | 99.81 |
| 3D-VisTA | 50.12 | 53.58 | 51.79 | 49.66 | 53.95 |
| LEO | 51.95 | 77.65 | 62.25 | 52.91 | 74.73 |
| 3D-GRAND | 93.34 | 84.25 | 88.56 | 89.12 | 45.13 |
- Precision for 3D-GRAND exceeds 93% in random and ≥70% in all negative sampling regimes, outperforming all prior baselines (≤52%).
On ScanRefer zero-shot (sim-to-real), 3D-GRAND achieves 38.0% [email protected] (compared to 30.3% for 3D-LLM and 17.1% for LLM-Grounder).
A data-scaling effect is observed: [email protected] grows ~linearly from ≈20% to ≈38% as training data increases from 0.2M to 6.2M examples; hallucination drops from ≈30% to ≈7%.
5. 3D-GRAND in Multi-Agent Gaussian SLAM
The 3D-GRAND approach constitutes the 3D extension of GRAND-SLAM's collaborative multi-agent Gaussian splatting-based SLAM (Thomas et al., 23 Jun 2025). The methodology integrates:
- 3D Gaussian Splatting for Scene Representation: Scene geometry and appearance are modeled as collections of explicit, anisotropic Gaussians ; rendering utilizes radiance accumulation along view rays.
- Submap-Based Local Tracking and Optimization: Each robot maintains overlapping submaps, seeded adaptively in under-observed regions and incrementally optimized via photometric and geometric rendering losses across multiple keyframes. Frame-to-map registration uses a hybrid color-depth odometry, refined through differentiable rendering alignment.
- Loop Closure and Pose-Graph Optimization: Both intra- and inter-agent loop closures are established via NetVLAD feature matching, followed by initial pose estimation, colored point cloud ICP refinement, and constraint gating (based on fitness and RMSE). The resulting pose-graph (nodes: keyframes; edges: tracking/loops) is optimized (e.g., via GTSAM Levenberg–Marquardt), updating all submap origins and globally aligning Gaussians.
- Performance Highlights:
- On Replica indoor datasets, 3D-GRAND (as GRAND-SLAM) yields a 28% improvement in RGB PSNR over prior strong baselines, and trajectory errors down to 0.25–0.27 cm.
- On outdoor Kimera-Multi sequences (1.85 km), it exhibits a 91% reduction in tracking error (ATE RMSE: 4.99 m vs. 60.79 m) and maintains rendering fidelity (PSNR ≈28 dB, SSIM ≈0.97, LPIPS ≈0.11), substantially surpassing Gaussian-SLAM and MAGiC-SLAM.
6. Key Insights, Implications, and Future Directions
- Dense Grounding: The explicit pairing of linguistic phrases to 3D objects is essential for minimizing hallucination and maximizing reference accuracy. Removal of grounding tokens leads to a measurable drop in evaluation metrics (see ablations in (Yang et al., 2024)).
- Data Scaling Law: Performance on grounding and faithfulness tasks scales nearly linearly with data volume, with no saturation observed at the 6M instruction level. Densely grounded data variants consistently outperform non-grounded counterparts.
- Sim-to-Real Transfer: Models trained exclusively on synthetic 3D-GRAND data transfer effectively to real scans, with strong zero-shot results and no fine-tuning on real data, indicating robust generalization.
- Standardization: 3D-POPE establishes a reproducible, public leaderboard and standardized benchmark for 3D-LLM hallucination and grounding, facilitating fair comparison across architectures.
- Embodied AI Directions: Research advocates integrating real-scan fine-tuning, multi-view LLM-vision fusion, and robotic embodiment (action grounding) leveraging 3D-GRAND pretraining as a foundation for more capable and reliable embodied agents.
3D-GRAND, as both a dataset for instruction-tuned vision-LLMs and as a core component in scalable multi-agent SLAM, provides critical infrastructure for achieving accurate, trustworthy 3D understanding in embodied AI systems (Yang et al., 2024, Thomas et al., 23 Jun 2025).