SpatialMosaic: Multi-View 3D Spatial Reasoning
- SpatialMosaic is a large-scale multi-view dataset with 2M QA training pairs designed to advance spatial reasoning and 3D scene understanding.
- It leverages high-fidelity indoor scenes from ScanNet++ with strict occlusion and overlap constraints to simulate realistic, fragmented observations.
- The accompanying SpatialMosaicVLM integrates visual and explicit 3D geometry encoders, significantly boosting accuracy and robustness over prior methods.
SpatialMosaic is a large-scale multi-view instruction-tuning dataset and evaluation benchmark specifically constructed to advance spatial reasoning and 3D scene understanding in Vision-LLMs (VLMs), particularly under partial visibility, occlusion, and low-overlap conditions. Building upon high-fidelity real-world indoor scenes from ScanNet++ and employing a scalable QA generation and annotation pipeline, SpatialMosaic introduces 2 million multimodal QA training pairs and a dedicated 1 million-sample benchmark (“SpatialMosaic-Bench”) covering six spatial reasoning tasks. The accompanying SpatialMosaicVLM architecture leverages both visual and explicit 3D geometry encoders, demonstrating substantial improvements over prior state-of-the-art methods in accuracy and transfer robustness (Lee et al., 29 Dec 2025).
1. Data Sources and Multi-View Sampling Strategy
SpatialMosaic is derived from 849 indoor scenes from ScanNet++ (Yeshwanth et al. 2023), each featuring dense RGB-D imagery and 3D semantic meshes. The dataset design enforces fragmented and complementary multi-view scenarios by sampling 2–5 view combinations per scene with strictly constrained 3D-point overlap. Specifically, the overlap between any two views is defined by: where aggregates all visible 3D points across object instances in view . Only view pairs with are retained to discourage trivial redundancy and elicit genuine multi-view reasoning.
Occlusion and visibility are rigorously annotated:
- Object-level occlusion ratio quantifies the fraction of object 's points physically occluded by other scene geometry within a view.
- Field-of-view occlusion ratio captures the proportion of an object lying beyond the observed image boundary, computed via extended rendering in a canvas.
A filtering protocol excludes objects fully visible in all selected views (no partial cues) or with occlusion in all views (excessive difficulty), ensuring that resulting samples reflect realistic fragmented observations.
2. QA Generation Pipeline and Task Formulation
SpatialMosaic’s QA generation consists of a four-stage pipeline:
- Combination Filtering: Candidate view sets (2–5 per sample) are selected based on occlusion/overlap constraints.
- Object Instance Selection: For each view combination, all object instances appearing in at least one but not all views are considered.
- 3D Bounding-Box Transformation: Object locations are computed in the camera frame of a “query view” using , enabling spatial relation extraction.
- Task-Specific QA Construction: Templates cover six distinct task types, each with systematic distractor generation through axis inversion or orthogonal contradiction:
| Task Name | Format | Description |
|---|---|---|
| Object Count | 4-way multiple choice | “How many [category] across these frames?” |
| Best-View Selection | 4-way multiple choice | “Which frame gives the most of [category]?” |
| Object Localization | Binary + bbox coordinates | “Is there an [object] in Frame 1? Where?” |
| Occlusion-Aware Object Existence | Binary | Relation-based existence between objects |
| Occlusion-Aware Attribute | 4-way single-answer | “Which [object] is to the left of the lamp?” |
| Occlusion-Aware Spatial Relation | 4-way multiple choice | “Where is X relative to Y?” |
Uniform sampling ensures balanced coverage across the six tasks, with ~333K QA pairs for each in the train partition.
3. Dataset Composition and Diagnostic Annotations
The SpatialMosaic dataset comprises approximately 8 million images (training split), 2 million multi-view QA training pairs, and 1 million QA pairs in the benchmark (SpatialMosaic-Bench), uniformly sampled across 849 scenes (679 for training, 170 for evaluation). Each sample involves 2–5 frames with diagnostic annotation per QA specifying:
- Visibility Scenario: Distinguishes fully versus partially visible targets.
- Ground Truth Coverage: Differentiates full versus partial scene coverage for category instances.
This annotation regime facilitates difficulty analysis and supports evaluation of spatial reasoning under authentic partial observability.
4. Benchmarking Protocol and Evaluation Metrics
SpatialMosaic-Bench operationalizes six rigorous evaluation tasks replicating the train set structure. The main metrics are:
- Accuracy: Percentage of correct responses for multiple-choice or binary outputs.
- (Optional) F1 Score: Used for binary tasks where relevant; class imbalance was not observed but fine-grained token-level F1 can be reported.
- Spatial Consistency: Alignment between predicted relations and ground-truth computed from 3D bounding-boxes.
Evaluations employ the aforementioned occlusion, visibility, and overlap measures to calibrate difficulty and interpret model performance.
5. SpatialMosaicVLM Model Architecture
SpatialMosaicVLM introduces a hybrid multi-view framework integrating explicit geometry modeling:
- Visual Encoder (): CLIP ViT backbone generating per-view patch tokens .
- Geometry Encoder (): VGGT 2025 architecture producing spatial tokens and camera tokens , concatenated to across views.
- Fusion: Cross-attention mechanism combines and :
with as trainable projections.
- LLM Integration: undergoes two-layer MLP projection, concatenation with tokenized question, and is passed to an LLM (e.g., LLaVA Next).
- Training Objective: Cross-entropy loss for all QA tasks; geometry-consistency loss is omitted, as encoders are frozen and only fusion plus QA head are fine-tuned.
A plausible implication is that explicit geometry encoding, via VGGT, provides significant gains in occlusion- and relation-sensitive settings where naïve vision transformers alone underperform.
6. Experimental Analysis and Performance Insights
Comprehensive results validate the utility of both dataset and architecture:
- SpatialMosaic-Bench: SpatialMosaicVLM (7B, fine-tuned) achieves 81.8% average accuracy; the best open-source baseline (LLaVA-NeXT-Video-7B) attains 47.8%, while VLM-3R (re-implemented, fine-tuned) reaches 81.7%. Fine-tuning on SpatialMosaic yields ≈34 percentage-point improvement over the strongest zero-shot baseline.
- Zero-Shot Transfer (VSTI-Bench): SpatialMosaicVLM attains 46.8% without further fine-tuning, outperforming best baseline (44.0%), indicating robust generalization to camera-centric and temporal tasks.
- Ablation Studies: Removal of geometry encoder (VGGT) reduces accuracy by ≈12 points, underscoring the necessity of explicit 3D cues. Performance sharply decreases with increasing occlusion/view-overlap, confirming the metrics’ correlation with task difficulty. In extreme “Partially Visible + Partial Coverage” scenarios, SpatialMosaicVLM maintains >65% accuracy, compared to >20-point drops in all competing methods.
This suggests that carefully annotated, occlusion-aware multi-view QA construction in SpatialMosaic, combined with hybrid vision-geometry models, substantially advances spatial reasoning capabilities in VLMs (Lee et al., 29 Dec 2025).