Papers
Topics
Authors
Recent
Search
2000 character limit reached

3D-GRAND: 3D Dataset & Multi-Agent SLAM

Updated 20 January 2026
  • 3D-GRAND is a dual-framework encompassing a large-scale, instruction-tuned dataset for 3D-LLMs and an extension of multi-agent SLAM using Gaussian splatting.
  • The dataset features 40,087 3D scenes with 6.2 million instruction pairs, achieving up to 95% grounding accuracy while significantly reducing hallucination through detailed validation.
  • The multi-agent SLAM component leverages submap tracking, loop closures, and pose-graph optimization to boost indoor rendering fidelity and reduce trajectory errors in large-scale environments.

3D-GRAND encompasses two distinct but thematically related research contributions spanning large-scale datasets for 3D language-vision models and advanced multi-agent SLAM systems based on Gaussian splatting. In context, "3D-GRAND" refers to: (i) a million-scale dataset and benchmarking suite for instruction-tuned 3D-LLMs targeting robust 3D grounding and hallucination reduction (Yang et al., 2024); and (ii) an abbreviation for the 3D extension of GRAND-SLAM—an architecture for scalable, globally consistent, multi-agent Gaussian SLAM (&&&1&&&).

1. 3D-GRAND: Definition and Scope

3D-GRAND, as introduced in (Yang et al., 2024), is a large-scale dataset specifically designed to bridge language and 3D perception for embodied AI. It contains 40,087 richly tagged household scenes (mainly from 3D-FRONT and Structured3D) and 6.2 million densely-grounded scene-language instruction pairs. The dataset enables instruction-tuning of 3D LLMs (3D-LLMs), supporting dense and faithful grounding of noun phrases to 3D objects, and provides a rigorous evaluation benchmark (3D-POPE) for systematic hallucination assessment.

In the context of multi-agent SLAM, "3D-GRAND" (as an Editor's term: "Gaussian Reconstruction via Multi-Agent Dense SLAM") refers to an extension of the GRAND-SLAM framework for collaborative scene reconstruction using 3D Gaussian splatting, tailored for both indoor and large-scale outdoor environments (Thomas et al., 23 Jun 2025).

2. Dataset Construction and Statistics

The 3D-GRAND dataset construction pipeline in (Yang et al., 2024) consists of several distinct stages:

  • Scene Collection: 38,000+ scenes are drawn from the 3D-FRONT and Structured3D repositories, featuring diverse room types. Meshes/layouts are rendered into colored point clouds via multi-view fusion (Blender-based pipeline), with spatial separation enforced by layout masks.
  • Densely-grounded Annotation: For each scene, the following procedure is executed:
    • Object Extraction: 2D object crops are generated; if ground truth is absent, "set-of-mark" GPT-4V prompts are used.
    • Attribute Tagging: Object category, color, finish, and texture are inferred using open-vocabulary GPT-4V processing.
    • Scene-Graph Assembly: Each object is represented as a JSON entry with (category, centroid (x,y,z)(x,y,z), extents (w,h,d)(w,h,d), attributes).
    • Instruction Generation: GPT-4/GPT-4-Turbo templates produce eight classes of scene-language tasks, including grounded object reference, scene description, question answering (existence, spatial, count), and multi-turn or landmark-driven variants.
    • Filtering and Augmentation: Low-quality or hallucinated outputs are filtered, and phrase-to-object tags are augmented via prompt-based approaches.
    • Human Validation: 10,200 scene-annotation pairs are validated by three raters per sample on Hive.ai ( ≥2/3 agreement); text truthfulness is 85–88% and grounding accuracy is 92–95%.

The key descriptive statistics are summarized below:

Attribute Value
Scenes 40,087
Instruction pairs 6.20 million
Avg. pairs per scene 155
Dense grounding 100% of noun phrases → 3D objects
Tasks/splits 8 task variants
Avg. objects per scene 86.4
Unique categories 182
Real-scan test set 1,400 ScanNet scenes (zero-shot)

This resource aims to provide explicit, high-quality grounding information critical for instruction-tuned 3D-LLMs.

3. Model Architectures, Training, and Losses

The primary instruction-tuned 3D-LLM architecture described in (Yang et al., 2024) consists of:

  • Backbone: Llama-2 (70B parameters), equipped with LoRA adapters for efficient fine-tuning.
  • Input Fusion: Object-centric 3D scene-graph descriptions prepended as JSON tokens (e.g., <obj id=12,cat=chair,x=1.2,...>).
  • Query Encoding: Instructions or questions appended to the textual context.
  • Output: Free-form response with explicit inline grounding tags (e.g., <ground obj=12>).

The training objective incorporates three loss terms: L=LCE+λgroundLground+λhallLhall\mathcal{L} = \mathcal{L}_{\text{CE}} + \lambda_{\text{ground}}\mathcal{L}_{\text{ground}} + \lambda_{\text{hall}} \mathcal{L}_{\text{hall}}

  • LCE\mathcal{L}_{\mathrm{CE}}: Standard cross-entropy on the target sequence.
  • Lground\mathcal{L}_{\mathrm{ground}}: Negative log-likelihood between phrase-object pairs; yij=1y_{ij}=1 if phrase ii grounds to object jj.
  • Lhall\mathcal{L}_{\mathrm{hall}}: BCE loss to penalize hallucinated object mentions in negative examples. Co-efficients: λground=1.0\lambda_{\mathrm{ground}} = 1.0, λhall=0.1\lambda_{\mathrm{hall}} = 0.1.

Explicit phrase-to-object tags are crucial for minimizing hallucination and improving faithfulness in object reference.

4. 3D-POPE Benchmark and Quantitative Evaluation

3D-POPE (3D Probing Of Precision and Existence) is introduced as a benchmark for standardized, compositional hallucination evaluation in 3D-LLMs (Yang et al., 2024). The design comprises:

  • Dataset: Derived from the ScanNet200 validation set (141 scenes, 200 classes).
  • Task Format: Each sample is a (scene, object category Q, binary answer) triple, balanced 1:1 positive:negative.
  • Negative Sampling: Random, popular (top-kk frequent absents), and adversarial (highest co-occurrence absent).

Evaluation metrics are standard: Precision=TPTP+FP,Recall=TPTP+FN,F1=2⋅Precision⋅RecallPrecision+Recall, Accuracy=TP+TNTotal,H (Hallucination rate)=1−Precision\text{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},\quad \text{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},\quad F_1 = 2\cdot \frac{\mathrm{Precision}\cdot \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}, \ \text{Accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{\text{Total}},\quad H\,(\text{Hallucination rate}) = 1 - \mathrm{Precision} All models are tested zero-shot on ScanNet to ensure strict comparability.

Representative Results

Model Precision Recall F1 Accuracy Yes (%)
Random Baseline 50.00 50.00 50.00 50.00 50.00
3D-LLM 50.03 99.88 66.67 50.07 99.81
3D-VisTA 50.12 53.58 51.79 49.66 53.95
LEO 51.95 77.65 62.25 52.91 74.73
3D-GRAND 93.34 84.25 88.56 89.12 45.13
  • Precision for 3D-GRAND exceeds 93% in random and ≥70% in all negative sampling regimes, outperforming all prior baselines (≤52%).

On ScanRefer zero-shot (sim-to-real), 3D-GRAND achieves 38.0% [email protected] (compared to 30.3% for 3D-LLM and 17.1% for LLM-Grounder).

A data-scaling effect is observed: [email protected] grows ~linearly from ≈20% to ≈38% as training data increases from 0.2M to 6.2M examples; hallucination drops from ≈30% to ≈7%.

5. 3D-GRAND in Multi-Agent Gaussian SLAM

The 3D-GRAND approach constitutes the 3D extension of GRAND-SLAM's collaborative multi-agent Gaussian splatting-based SLAM (Thomas et al., 23 Jun 2025). The methodology integrates:

  1. 3D Gaussian Splatting for Scene Representation: Scene geometry and appearance are modeled as collections of explicit, anisotropic Gaussians Gi(μi,Σi,oi,ci)G_i(\mu_i, \Sigma_i, o_i, c_i); rendering utilizes radiance accumulation along view rays.
  2. Submap-Based Local Tracking and Optimization: Each robot maintains overlapping submaps, seeded adaptively in under-observed regions and incrementally optimized via photometric and geometric rendering losses across multiple keyframes. Frame-to-map registration uses a hybrid color-depth odometry, refined through differentiable rendering alignment.
  3. Loop Closure and Pose-Graph Optimization: Both intra- and inter-agent loop closures are established via NetVLAD feature matching, followed by initial pose estimation, colored point cloud ICP refinement, and constraint gating (based on fitness and RMSE). The resulting pose-graph (nodes: keyframes; edges: tracking/loops) is optimized (e.g., via GTSAM Levenberg–Marquardt), updating all submap origins and globally aligning Gaussians.
  4. Performance Highlights:
    • On Replica indoor datasets, 3D-GRAND (as GRAND-SLAM) yields a 28% improvement in RGB PSNR over prior strong baselines, and trajectory errors down to 0.25–0.27 cm.
    • On outdoor Kimera-Multi sequences (1.85 km), it exhibits a 91% reduction in tracking error (ATE RMSE: 4.99 m vs. 60.79 m) and maintains rendering fidelity (PSNR ≈28 dB, SSIM ≈0.97, LPIPS ≈0.11), substantially surpassing Gaussian-SLAM and MAGiC-SLAM.

6. Key Insights, Implications, and Future Directions

  • Dense Grounding: The explicit pairing of linguistic phrases to 3D objects is essential for minimizing hallucination and maximizing reference accuracy. Removal of grounding tokens leads to a measurable drop in evaluation metrics (see ablations in (Yang et al., 2024)).
  • Data Scaling Law: Performance on grounding and faithfulness tasks scales nearly linearly with data volume, with no saturation observed at the 6M instruction level. Densely grounded data variants consistently outperform non-grounded counterparts.
  • Sim-to-Real Transfer: Models trained exclusively on synthetic 3D-GRAND data transfer effectively to real scans, with strong zero-shot results and no fine-tuning on real data, indicating robust generalization.
  • Standardization: 3D-POPE establishes a reproducible, public leaderboard and standardized benchmark for 3D-LLM hallucination and grounding, facilitating fair comparison across architectures.
  • Embodied AI Directions: Research advocates integrating real-scan fine-tuning, multi-view LLM-vision fusion, and robotic embodiment (action grounding) leveraging 3D-GRAND pretraining as a foundation for more capable and reliable embodied agents.

3D-GRAND, as both a dataset for instruction-tuned vision-LLMs and as a core component in scalable multi-agent SLAM, provides critical infrastructure for achieving accurate, trustworthy 3D understanding in embodied AI systems (Yang et al., 2024, Thomas et al., 23 Jun 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D-GRAND.