Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpatialMosaic: Multi-View 3D Spatial Reasoning

Updated 25 January 2026
  • SpatialMosaic is a large-scale multi-view dataset with 2M QA training pairs designed to advance spatial reasoning and 3D scene understanding.
  • It leverages high-fidelity indoor scenes from ScanNet++ with strict occlusion and overlap constraints to simulate realistic, fragmented observations.
  • The accompanying SpatialMosaicVLM integrates visual and explicit 3D geometry encoders, significantly boosting accuracy and robustness over prior methods.

SpatialMosaic is a large-scale multi-view instruction-tuning dataset and evaluation benchmark specifically constructed to advance spatial reasoning and 3D scene understanding in Vision-LLMs (VLMs), particularly under partial visibility, occlusion, and low-overlap conditions. Building upon high-fidelity real-world indoor scenes from ScanNet++ and employing a scalable QA generation and annotation pipeline, SpatialMosaic introduces 2 million multimodal QA training pairs and a dedicated 1 million-sample benchmark (“SpatialMosaic-Bench”) covering six spatial reasoning tasks. The accompanying SpatialMosaicVLM architecture leverages both visual and explicit 3D geometry encoders, demonstrating substantial improvements over prior state-of-the-art methods in accuracy and transfer robustness (Lee et al., 29 Dec 2025).

1. Data Sources and Multi-View Sampling Strategy

SpatialMosaic is derived from 849 indoor scenes from ScanNet++ (Yeshwanth et al. 2023), each featuring dense RGB-D imagery and 3D semantic meshes. The dataset design enforces fragmented and complementary multi-view scenarios by sampling 2–5 view combinations per scene with strictly constrained 3D-point overlap. Specifically, the overlap between any two views (i,j)(i,j) is defined by: Overlap(i,j)=ViVjViVj,\text{Overlap}(i, j) = \frac{|V^i \cap V^j|}{|V^i \cup V^j|}, where ViV^i aggregates all visible 3D points across object instances in view ii. Only view pairs with Overlap<0.3\text{Overlap} < 0.3 are retained to discourage trivial redundancy and elicit genuine multi-view reasoning.

Occlusion and visibility are rigorously annotated:

  • Object-level occlusion ratio rn,objr_{n,\text{obj}} quantifies the fraction of object nn's points physically occluded by other scene geometry within a view.
  • Field-of-view occlusion ratio rn,FoVr_{n,\text{FoV}} captures the proportion of an object lying beyond the observed image boundary, computed via extended rendering in a 2H×2W2H \times 2W canvas.

A filtering protocol excludes objects fully visible in all selected views (no partial cues) or with >90%>90\% occlusion in all views (excessive difficulty), ensuring that resulting samples reflect realistic fragmented observations.

2. QA Generation Pipeline and Task Formulation

SpatialMosaic’s QA generation consists of a four-stage pipeline:

  1. Combination Filtering: Candidate view sets (2–5 per sample) are selected based on occlusion/overlap constraints.
  2. Object Instance Selection: For each view combination, all object instances appearing in at least one but not all views are considered.
  3. 3D Bounding-Box Transformation: Object locations are computed in the camera frame of a “query view” using v(w)v(c)=Rwc(vtwc)v^{(w)} \rightarrow v^{(c)} = R_{wc}(v-t_{wc}), enabling spatial relation extraction.
  4. Task-Specific QA Construction: Templates cover six distinct task types, each with systematic distractor generation through axis inversion or orthogonal contradiction:
Task Name Format Description
Object Count 4-way multiple choice “How many [category] across these frames?”
Best-View Selection 4-way multiple choice “Which frame gives the most of [category]?”
Object Localization Binary + bbox coordinates “Is there an [object] in Frame 1? Where?”
Occlusion-Aware Object Existence Binary Relation-based existence between objects
Occlusion-Aware Attribute 4-way single-answer “Which [object] is to the left of the lamp?”
Occlusion-Aware Spatial Relation 4-way multiple choice “Where is X relative to Y?”

Uniform sampling ensures balanced coverage across the six tasks, with ~333K QA pairs for each in the train partition.

3. Dataset Composition and Diagnostic Annotations

The SpatialMosaic dataset comprises approximately 8 million images (training split), 2 million multi-view QA training pairs, and 1 million QA pairs in the benchmark (SpatialMosaic-Bench), uniformly sampled across 849 scenes (679 for training, 170 for evaluation). Each sample involves 2–5 frames with diagnostic annotation per QA specifying:

  • Visibility Scenario: Distinguishes fully versus partially visible targets.
  • Ground Truth Coverage: Differentiates full versus partial scene coverage for category instances.

This annotation regime facilitates difficulty analysis and supports evaluation of spatial reasoning under authentic partial observability.

4. Benchmarking Protocol and Evaluation Metrics

SpatialMosaic-Bench operationalizes six rigorous evaluation tasks replicating the train set structure. The main metrics are:

  • Accuracy: Percentage of correct responses for multiple-choice or binary outputs.
  • (Optional) F1 Score: Used for binary tasks where relevant; class imbalance was not observed but fine-grained token-level F1 can be reported.
  • Spatial Consistency: Alignment between predicted relations and ground-truth computed from 3D bounding-boxes.

Evaluations employ the aforementioned occlusion, visibility, and overlap measures to calibrate difficulty and interpret model performance.

5. SpatialMosaicVLM Model Architecture

SpatialMosaicVLM introduces a hybrid multi-view framework integrating explicit geometry modeling:

  • Visual Encoder (EvisE_{vis}): CLIP ViT backbone generating per-view patch tokens Fvis(v)RTvis×dF_{vis}^{(v)} \in \mathbb{R}^{T_{vis} \times d}.
  • Geometry Encoder (EgeoE_{geo}): VGGT 2025 architecture producing spatial tokens FspaF_{spa} and camera tokens zz, concatenated to FgeoR(Tspa+V)×dF_{geo} \in \mathbb{R}^{(T_{spa}+V)\times d} across views.
  • Fusion: Cross-attention mechanism combines FvisF_{vis} and FgeoF_{geo}:

Ffuse=softmax((FvisWq)(FgeoWk)T/dk)(FgeoWv)F_{fuse} = \text{softmax}\big((F_{vis} W_q)(F_{geo} W_k)^T / \sqrt{d_k}\big)(F_{geo} W_v)

with Wq,Wk,WvW_q, W_k, W_v as trainable projections.

  • LLM Integration: FfuseF_{fuse} undergoes two-layer MLP projection, concatenation with tokenized question, and is passed to an LLM (e.g., LLaVA Next).
  • Training Objective: Cross-entropy loss LCE=cchoicesyclogpcL_{CE} = -\sum_{c \in \text{choices}} y_c \log p_c for all QA tasks; geometry-consistency loss is omitted, as encoders are frozen and only fusion plus QA head are fine-tuned.

A plausible implication is that explicit geometry encoding, via VGGT, provides significant gains in occlusion- and relation-sensitive settings where naïve vision transformers alone underperform.

6. Experimental Analysis and Performance Insights

Comprehensive results validate the utility of both dataset and architecture:

  • SpatialMosaic-Bench: SpatialMosaicVLM (7B, fine-tuned) achieves 81.8% average accuracy; the best open-source baseline (LLaVA-NeXT-Video-7B) attains 47.8%, while VLM-3R (re-implemented, fine-tuned) reaches 81.7%. Fine-tuning on SpatialMosaic yields ≈34 percentage-point improvement over the strongest zero-shot baseline.
  • Zero-Shot Transfer (VSTI-Bench): SpatialMosaicVLM attains 46.8% without further fine-tuning, outperforming best baseline (44.0%), indicating robust generalization to camera-centric and temporal tasks.
  • Ablation Studies: Removal of geometry encoder (VGGT) reduces accuracy by ≈12 points, underscoring the necessity of explicit 3D cues. Performance sharply decreases with increasing occlusion/view-overlap, confirming the metrics’ correlation with task difficulty. In extreme “Partially Visible + Partial Coverage” scenarios, SpatialMosaicVLM maintains >65% accuracy, compared to >20-point drops in all competing methods.

This suggests that carefully annotated, occlusion-aware multi-view QA construction in SpatialMosaic, combined with hybrid vision-geometry models, substantially advances spatial reasoning capabilities in VLMs (Lee et al., 29 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpatialMosaic Dataset.