Spatial Reasoning Benchmarks: SAT & MMSI

Updated 10 February 2026

Spatial reasoning benchmarks are specialized evaluation suites that assess both static and dynamic spatial cognition in multimodal and vision-language models.
SAT emphasizes synthetic static and dynamic spatial aptitude testing with significant performance gains yet persistent challenges in dynamic transformations.
MMSI-Bench rigorously evaluates multi-image spatial integration using human-curated annotations to reveal error patterns in cross-view correspondence and spatial logic.

Spatial reasoning benchmarks are specialized evaluation suites designed to rigorously probe the spatial intelligence of multimodal and vision-LLMs. Two benchmarks—SAT (Spatial Aptitude Training) and MMSI-Bench (Multi-iMage Spatial Intelligence Benchmark)—represent foundational resources in this domain, each targeting distinct axes of spatial reasoning, from dynamic egocentric transformations to multi-view geometric integration. These benchmarks inform both diagnosis of current model limitations and the development of robust, three-dimensional spatial intelligence.

1. Key Concepts and Motivation

Spatial reasoning tasks demand the integration of multiple capabilities: understanding visual scenes, modeling transformations due to object or agent motion, integrating information across diverse viewpoints, and maintaining consistency in geometric relationships. Traditional tests (e.g., "Spatial Arrangement Task"—SAT in psychophysics) use single images or static geometric puzzles, but these are insufficient for modern applications such as embodied robotics, dynamic navigation, or multi-camera analytics.

MMSI-Bench was introduced to address limitations in earlier spatial benchmarks, emphasizing real-world multi-image integration, step-by-step reasoning annotation, and fine-grained error analysis (Yang et al., 29 May 2025). SAT was developed to train and assess both static and dynamic spatial awareness, providing synthetic, perfectly-annotated data for simulation-grounded instruction (Ray et al., 2024). Both benchmarks serve as diagnostic tools to localize failure modes and as catalysts for training data and architecture advancements.

2. SAT: Dynamic Spatial Aptitude Training

SAT is a synthetic benchmark focused on both static and dynamic spatial reasoning. Static tasks involve determining relationships from a single frame (e.g., left/right, depth ordering), while dynamic tasks require reasoning over motion, either of the agent (egocentric movement), objects, or hypothetical actions (goal-aiming, action consequence). Core mathematical formalisms include representation of agent and object coordinates, application of rotation matrices (for agent view updates), and calculation of object relative bearings.

Dataset Composition: 218,000 QA pairs over 22,000 procedurally generated 3D indoor scenes, derived from the ProcTHOR–10K asset suite. 127,000 pairs are static (single image), while 86,000 probe dynamic capabilities. Each scene provides metadata for precise state labeling, enabling perfect answer annotation (Ray et al., 2024).
Dynamic Test Set: To benchmark real-world transfer, 805 photographs were annotated with 4,000 challenging dynamic spatial questions spanning egocentric movement, object movement, allocentric perspective, goal aiming, and action consequence.
Training Protocol: Instruction-tuning is performed with LoRA adapters on LLaVA architectures. Synthetic SAT training is demonstrated to yield improvements exceeding those of pseudo-annotated real image corpora.
Evaluation Results:
- On traditional real-image spatial benchmarks (e.g., CVBench, BLINK, VSR), SAT-tuned models (LLaVA-13B) show dramatic gains, e.g., from 51.4% to 74.3% on CVBench when both static and dynamic SAT QAs are included.
- On the dynamic spatial test set, SAT-tuned LLaVA-13B achieves 88.6% accuracy versus random or proprietary model baselines at ~50%, underscoring the efficacy of simulation-trained spatial faculties.
Limitations and Challenges: While static reasoning improves substantially, dynamic tests—especially those requiring fine-grained egocentric frame transformations—remain challenging for models, even after intensive training. Current synthetic data do not fully capture the diversity and complexity of real-world settings.

3. MMSI-Bench: Multi-Image Spatial Intelligence

MMSI-Bench is a large-scale, human-curated benchmark constructed to rigorously evaluate multi-image spatial reasoning in real-world contexts. Each question requires integrating at least two images—drawn from datasets spanning indoor scans, driving scenes, robotics, and egocentric video streams—and synthesizing information for position, attribute, and motion queries.

Dataset and Task Taxonomy:
- 1,000 multiple-choice questions, each referencing an average of 2.55 images (range: 2–10 views), selected from >120,000 candidate images after 300+ expert annotation hours.
- Question taxonomy spans position (camera–camera, object–object, region–region, etc.), attributes (measurement, appearance), motion (camera/object), and multi-step reasoning combining multiple atomic operations.
Evaluation Approach:
- Metric: Exact-match accuracy.
- Step-by-step annotated reasoning chains accompany each item, enabling error categorization (grounding error, overlap-matching/scene reconstruction, situation-transformation, spatial-logic error).
- Human baseline: 97% accuracy; top proprietary model (OpenAI o3) at 41%, best open source (Qwen2.5-VL-72B) at ~31% (Yang et al., 29 May 2025).
Failure Modes:
- Overlap-matching and scene-reconstruction errors dominate (~35–40%), indicating poor cross-view correspondence establishment.
- Systematic grounding failures (25%), transformation errors (20%), and spatial logic errors (15%) reveal deep-seated bottlenecks in current multimodal models.
- Parameter scaling provides diminishing returns; targeted training on diverse, multi-view geometry tasks is more effective.
Comparative Analysis:
- MMSI-Bench fills a diagnostic gap left by static or template-based benchmarks such as SAT, CLEVR, VSR, GeoQA, by requiring explicit multi-view integration and stepwise reasoning.
- Existing multi-image or synthetic benchmarks display smaller human–AI error gaps and lack the fine-grained, human-verified reasoning annotations present in MMSI.

4. Extensions: Video, Simulation, and Specialized Benchmarks

The trajectory of spatial reasoning evaluation moves further into video and dynamic environments. SAT is extended by datasets such as MMSI-Video-Bench—incorporating temporal construction, motion understanding, planning, prediction, and cross-video reasoning (Lin et al., 11 Dec 2025). Performance on these dynamic suites is lower still: Gemeni 3 Pro peaks at ~38% accuracy against a human baseline of ~96%. Errors are distributed across detailed grounding, geometric reasoning, and identification across frames.

Synthetic benchmarks such as STARE (multi-step 2D/3D geometric transformations, integrated puzzles, real-world reasoning) challenge models on multi-step simulation, showing human–AI performance gaps of >40% in complex categories (Li et al., 5 Jun 2025). Specialized tests (e.g., DynaSolidGeo, CubeBench, SolidGeo) probe 3D mental model construction, dynamic instantiation, and technically rigorous spatial mathematics, further elaborating the scope of spatial reasoning assessment (Wu et al., 25 Oct 2025, Gao et al., 29 Dec 2025, Wang et al., 27 May 2025).

5. Evaluation Methodologies and Metrics

Both SAT and MMSI-Bench employ accuracy as the primary metric but with crucial protocol distinctions:

SAT: Evaluated as the fraction of correct answers, stratified across static and dynamic categories. Real-image dynamic test and long-video embodied tasks are also considered. Ablations demonstrate the synergy between static and dynamic training data.
MMSI-Bench: Uses exact-match accuracy but incorporates granular error analysis via stepwise annotated reasoning. This enables automated diagnosis and longitudinal tracking of error distributions.
Process-Aware Evaluation: Emerging in DynaSolidGeo and MMSI-Bench, the inclusion of expert-annotated solution chains and process-qualified accuracy uncovers whether models achieve correct results through logical reasoning or shallow pattern recognition.

6. Research Directions and Future Challenges

Benchmarks reveal recurring bottlenecks: difficulty with dynamic perspective-taking, flawed cross-view identity mapping, failure in integrating partial cues, and breakdown on reasoning chains. Advances under exploration include:

Diversity scaling in training: Data scaling to millions of instruction-tuned spatial QAs (e.g., SenseNova-SI-8M) produces measurable improvements in core perspective-taking and multi-image spatial integration (Cai et al., 17 Nov 2025).
Process supervision: Annotated reasoning chains facilitate stepwise supervision or contrastive loss optimization.
World model integration: Test-time scaling and visual imagination modules (e.g., via MindJourney or ViSA) aim to supplement static inputs with synthesized views, though current world models bottleneck due to texture and layout fidelity limits (Jha et al., 5 Dec 2025).
Differentiable spatial representations and drawing-based interaction: Explicit addition of drawing primitives or 3D mesh representations enables diagnosis and mitigation of failures in geometric visualization (Wu et al., 11 Jun 2025).
Extension to text-only and theoretical tasks: Benchmarks such as SiT-Bench test symbolic spatial reasoning without visual cues, probing the boundary between language and geometric computation (Guo et al., 7 Jan 2026).

Systematic evaluation on SAT, MMSI, and derivative benchmarks is thus crucial for the advancement of general-purpose multimodal agents that must reason and act reliably in spatially complex, real-world environments.