SpatialTree: How Spatial Abilities Branch Out in MLLMs

Published 23 Dec 2025 in cs.CV | (2512.20617v1)

Abstract: Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

Abstract PDF Upgrade to Chat

Summary

The paper presents a cognitive-science-inspired hierarchical taxonomy that decomposes spatial intelligence into four levels: perception, mental mapping, simulation, and agentic competence.
It introduces SpatialTree-Bench, a comprehensive benchmark that aggregates existing datasets and novel queries to evaluate multi-level spatial abilities in MLLMs.
Results reveal significant cross-level transfer dynamics, with fine-tuning on foundational tasks markedly enhancing higher-level reasoning and agentic performance.

Hierarchical Taxonomy and Benchmarking of Spatial Intelligence in Multimodal LLMs

Introduction and Motivation

Spatial understanding is foundational to cognition and intelligent interaction with the physical world. In multimodal LLMs (MLLMs), spatial abilities underpin high-level agentic reasoning and embodied action, yet have historically been evaluated via narrow, task-specific probes lacking theoretical unification. "SpatialTree: How Spatial Abilities Branch Out in MLLMs" (2512.20617) presents a cognitive-science-inspired hierarchical taxonomy for spatial intelligence, decomposing abilities into four levels—perception, mental mapping, simulation, and agentic competence. The paper introduces a capability-centric benchmark, SpatialTree-Bench, aggregating and augmenting spatial tasks to systematically diagnose spatial ability emergence, interaction, and transfer in current MLLMs.

Figure 1: SpatialTree’s four-layer hierarchy, progressing from foundational multimodal capabilities (L0) through basic perception (L1) to agentic competence (L4).

Taxonomy and Abilities

SpatialTree formalizes spatial intelligence in MLLMs across four levels:

L1 Perception: Intuitive, pre-linguistic capacities for geometry (distance, size, shape), motion (egocentric/allocentric), orientation (gravity, pose), relation (structural topology, correspondence), and localization (detection, grounding).
L2 Mental Mapping: Linguistic grounding of spatial primitives; semantic scene understanding, affordance recognition, perspective taking, memory formation, and cognitive map construction.
L3 Simulation: Internal reasoning via causal inference and sequential planning; mental simulation aligns with chain-of-thought paradigms, permitting strategy design and logical spatial deduction.
L4 Agentic Competence: Execution of goal-directed actions; embodied interaction as sequential decision-making combining multi-modal perception, spatial memory, and action generation in diverse scenarios.

The taxonomy captures the developmental progression of spatial cognition, aligning computational evaluation with classical theories in psychology and AI. Each level is illustrated with task instantiations and detailed queries, showing increased complexity and dependency between layers.

Benchmark Construction and Evaluation Protocols

SpatialTree-Bench is constructed by reorganizing existing datasets and bridging capability gaps with SpatialPlus, an auto-curated dataset for underrepresented abilities, especially agentic competence (L4). Level-specific expert models annotate foundational capabilities, while higher-level tasks rely on multi-modal reasoning QAs and action sequence prompts.

Figure 2: Data engines for each SpatialTree level instantiate QAs for granular model evaluation.

Benchmark evaluation emphasizes multi-choice, numeric accuracy, and LLM-as-a-Judge protocols, tailored to the nature of the abilities probed. The primary metric aggregation employs hierarchical weighting derived from correlation analysis and cognitive hierarchy, prioritizing foundational perception.

Figure 3: Distribution of metrics—multiple-choice forms the bulk, complemented by numeric metrics and LLM-based evaluation for open-ended abilities.

Ability Dependencies and Transfer Dynamics

Extensive benchmarking of mainstream open-source and proprietary MLLMs reveals marked patterns:

Orthogonality in L1: Foundational skills (geometry, motion, orientation, etc.) are largely uncorrelated, suggesting independence in acquisition.
Strong Correlation in L3/L4: High-level reasoning and agentic competence exhibit notable cross-skill dependencies, highlighting the compositional nature of advanced spatial reasoning.
Transfer Dynamics from Fine-tuning: Targeted SFT on atomic L1 abilities (e.g., distance estimation) yields negligible or negative intra-level gain but produces significant improvements (+36.0% in reasoning, +27.1% in manipulation) in higher-level, out-of-distribution tasks (Figure 4).
Figure 4: Fine-tuning on distance QAs (top) transfers zero-shot to unseen reasoning and robotic manipulation (middle and bottom), substantiating cross-level synergy.

Combining multiple foundational abilities in SFT produces synergistic improvements exceeding the sum of individual effects, particularly for abilities like motion and relation that require integrative reasoning.

Reinforcement Learning and Hierarchy-Awareness

Applying reinforcement learning with verifiable rewards (RLVR) using group relative policy optimization demonstrates that uniform RL strategies are suboptimal: low-level skills (L1) suffer from over-reasoning, degrading performance, while high-level tasks benefit from explicit logical elaboration. A hierarchy-aware reward mechanism—suppressing reasoning for perception but amplifying it for complex planning—enables consistent performance gains across all levels.

Benchmark Aggregation and Metric Weighting

SpatialTree aggregates scores using a hierarchical weighting scheme:

Figure 5: Hierarchical weighting for metric aggregation prioritizes foundational perceptual (L1) abilities, recursively summing child node scores.

Pearson correlation analysis and empirical performance inform atomic skill weighting, yielding a principled, holistic spatial intelligence metric for model comparison.

Representative QA Examples and Agentic Task Curation

SpatialTree-Bench includes a diverse set of QAs for each ability, using expert models and manual annotation to ensure capability coverage. For agentic competence, tasks are defined via discrete action primitive mapping, facilitating embodied evaluation in navigation and manipulation. Prompt engineering and explicit low-level context (e.g., correspondence, size, depth) significantly improve MLLM planning accuracy.

Figure 6: Correspondence prompting enhances navigation accuracy, substantiating cross-level guidance.

Implications and Future Directions

This work establishes a rigorous, hierarchical framework for spatial intelligence assessment in MLLMs, revealing critical cross-level transfer and compositional synergy among abilities. The results advocate for capability-centric training protocols—prioritizing foundational perception for pre-training, exploiting compositional RL for post-training, and utilizing real-world interaction for agentic competence acquisition.

The demonstrated need for hierarchy-aware strategies in both fine-tuning and RL training poses clear mandates for future model architectures and curriculum design. Further, the benchmark highlights critical gaps in open-source models relative to proprietary MLLMs, suggesting avenues for research in generalizable spatial reasoning and embodied AI.

Conclusion

SpatialTree introduces a cognitive-science-grounded taxonomy and benchmark to diagnose and scale spatial abilities in MLLMs. Empirical analysis uncovers significant cross-level dependency and transfer, particularly from foundational perception to high-level agentic competence. The framework paves the way for principled, capability-centric scaling of spatial intelligence, with direct implications for embodied agents, robotics, and autonomous systems. Future developments should target curriculum alignment with taxonomy structure, compositional training pipelines, and broader integration of agentic evaluation in multimodal model development.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about (in plain language)

This paper is about teaching and testing AI models that can both see and read (called multimodal LLMs, or MLLMs) to be smart about space. “Space” here means things like how big objects are, how far away they are, where they are, how they move, and how to use that information to plan and act. The authors build a “SpatialTree,” a simple step-by-step ladder of skills that starts with seeing and ends with acting, and they create a big test to see how well different AIs do at each step.

Think of it like learning to drive: first you judge distances and sizes (perception), then you describe where things are (mapping), then you plan routes (simulation), and finally you actually drive and turn the wheel (acting).

What questions did the researchers ask?

They mainly asked:

Can we organize “spatial smarts” into a clear set of levels, from basic seeing to real-world acting?
How do these levels depend on each other? Do strong basics help with advanced skills?
What kinds of training help or hurt these skills? For example, does making a model “think out loud” help with everything, or only some parts?

How they studied it (the approach, explained simply)

First, they built a four-level “SpatialTree” of abilities. Each level builds on the one below it:

Level 1: Perception (L1)
- The model senses basic facts from images or videos: size, distance, shape, motion, orientation (which way is up), where objects are, and how they relate (inside/outside, left/right).
Level 2: Mental Mapping (L2)
- The model connects what it sees to language: describing scenes in words, understanding action words, taking other viewpoints, noticing what objects can be used for (affordances), and building a simple “mental map” to remember where things are.
Level 3: Mental Simulation (L3)
- The model imagines what will happen next and plans steps: if I push this, will it fall? How do I get from here to there? It’s like mentally running a mini-simulator.
Level 4: Agentic Competence (L4)
- The model turns plans into actions: navigating a character in a game, telling a robot arm how to move, or choosing where to go next in a video of the real world.

Then they built a big benchmark (a set of tests) covering 27 specific skills across these four levels. They pulled together and cleaned many existing datasets, added missing pieces, and created new tasks—especially for the acting level (L4). Tasks included multiple-choice questions, number estimates (like distance), and action choices for agents. They tested many popular models (both open-source and commercial).

Finally, they ran training experiments:

Supervised fine-tuning (SFT): teaching a model with examples of the “right” answers for specific skills like distance or size.
Reinforcement Learning (RL): rewarding the model when it gets tasks right. They tried “think a lot” training (encouraging step-by-step explanations) and a new “auto-think” approach that only encourages thinking when it’s useful.

They also checked how skills relate using simple statistics: if a model is good at skill A, is it also good at skill B?

The main findings (what they discovered and why it matters)

Here are the big takeaways:

Basic skills are separate, advanced skills depend on each other.
- L1 (perception) abilities like distance, size, and motion don’t strongly move together—being great at one doesn’t guarantee being great at another.
- L3 and L4 (reasoning and acting) are strongly linked—if you’re good at one complex skill, you’re often good at others. Advanced tasks rely on combining several basics.
Training one basic skill can hurt other basics but help advanced tasks.
- Teaching only “distance” or only “size” sometimes makes other L1 skills worse (negative transfer).
- But surprisingly, it can still make L3/L4 tasks better. In other words, stronger basics can boost higher-level performance even if they conflict with other basics.
- Training multiple basics together leads to synergy. Combining distance + size + correspondence helps more than training each alone, and even fixes some earlier drops.
“Thinking a lot” isn’t always good.
- For complex reasoning and planning, encouraging the model to think step-by-step helps.
- For fast, intuitive perception (like estimating distance), overthinking makes things worse. It’s like overanalyzing when you just need a quick glance.
A simple fix: auto-think.
- The authors propose an “auto-think” strategy: don’t force long explanations for quick perception tasks, but do encourage them for complex planning. With this, reinforcement learning improved results across all levels more consistently.
Model performance snapshot.
- The strongest commercial “thinking” models did best overall, but the best open-source models were competitive and improving.

Why this research matters (the bigger picture)

It gives a clear roadmap for building spatially smart AI.
- Train and test AI step by step, from seeing to acting, instead of mixing everything and hoping it works.
- Use multi-skill training at the basic level to avoid conflicts and unlock synergy.
- Control when the model should “think out loud” and when it should rely on fast, intuitive perception.
It helps make better real-world agents.
- Robots that can grasp objects, navigate rooms, or help people.
- Game agents that plan and act intelligently.
- AR/VR assistants that understand where things are and what you can do with them.
It offers a shared benchmark.
- Researchers can compare models fairly, find weak spots, and know what to improve next.

A simple way to picture it

Learning spatial intelligence is like learning to drive:

L1: You quickly judge distance, size, and motion around you.
L2: You describe routes and remember landmarks.
L3: You plan the best path and predict what other cars will do.
L4: You actually drive smoothly and safely. If you overthink when parallel parking, you might do worse. But for planning a long trip, careful thinking helps. The key is knowing when to think hard and when to act fast.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of what remains missing, uncertain, or left unexplored in the paper—phrased concretely so future work can act on it:

L4 agentic competence is evaluated offline via discretized key–mouse primitives and multi-step MCQs; there is no closed-loop execution in simulators or real environments with measurable task success, recovery, and safety. Develop standardized interactive environments with executable action APIs and run-time feedback to assess end-to-end performance.
The “auto-think” strategy is described qualitatively (suppressing unnecessary deliberation, length penalties) but lacks a formal decision policy or gating mechanism specifying when to think vs. act intuitively. Define and test adaptive gating (e.g., confidence-based, task-type classifiers, uncertainty thresholds).
Pearson correlation analysis is observational and confounded by model family and scale; no controlled experiments isolate causal dependencies among abilities. Perform randomized ablations (e.g., keep model constant while varying single ability data) and use causal inference or Shapley value analyses to identify true drivers of cross-level performance.
Single-ability SFT causes negative transfer within L1, but mitigation strategies are not explored. Investigate multi-task optimization methods (e.g., gradient surgery, orthogonal adapters, decoupled heads, loss-balancing, task-aware regularization) to reduce intra-level interference.
Multi-ability “synergy” findings are based on one specific L1 combination (Distance+Size+Correspondence) in one model/training recipe. Test broader ability sets, different compositions, and multiple model families to determine generality and robustness.
RL experiments are limited to Qwen2.5-VL-7B using GRPO and MCQ-based verifiable rewards; the approach’s generality across architectures, reward forms, and open-ended metrics is unknown. Benchmark across diverse models and RL algorithms (e.g., PPO, DPO variants, offline RL) and introduce rewards tied to continuous outcomes (e.g., spatial error, trajectory adherence).
RL improves some high-level tasks while degrading Memory and certain L1 skills; causes are not diagnosed. Analyze token budget allocation, reasoning length vs. perception accuracy trade-offs, and introduce memory-specific rewards or architectures (e.g., external memory modules) to prevent regressions.
L4 action space is discretized and embodiment-agnostic, potentially losing important control fidelity and subtleties of physics. Evaluate continuous-control tasks (robot manipulators, locomotion) and measure low-level command accuracy, stability, and sim-to-real transfer.
The benchmark heavily relies on multiple-choice formats (≈70.7%) and LLM-as-a-Judge, which may favor test-taking heuristics and introduce subjectivity. Incorporate more reference-based, open-ended metrics and calibrate LLM judging with gold standards, human adjudication, and reliability audits.
Distractor quality for MCQs (often LLM-generated/rephrased) is not validated for difficulty balance or bias. Systematically grade distractors (e.g., IRT-based difficulty modeling), ensure non-leading options, and release adverse sets for robustness testing.
Memory (cognitive maps and retrieval) evaluation lacks standardized, transparent metrics; details are sparse. Define and publish metrics for map accuracy (e.g., IoU/IoE in BEV), localization error, temporal recall precision/recall, and occlusion-aware retrieval.
Expert-model-derived labels (DepthAnything3, OrientAnything, trackers) can inject systematic noise or bias; no label-quality audits are provided. Quantify error rates, propagate uncertainty into training/evaluation, and benchmark against human annotations on stratified subsets.
Many L1/L2 training sets are indoor (SUNRGBD, Hypersim, Matterport3D); outdoor, cluttered, weathered, and rare-scene coverage is limited. Expand to varied domains (urban, natural, extreme lighting/weather, sensor noise) and evaluate domain generalization.
Orientation and gravity sensing are derived from vision-only cues; multimodal sensing (IMU, depth, LiDAR) is not considered. Test multimodal inputs to disentangle visual orientation from priors and quantify benefits of additional sensors.
Cross-lingual spatial semantics and perspective-taking are not evaluated; tasks appear English-only. Create multilingual benchmarks capturing linguistic variability in spatial prepositions, frames of reference (egocentric/allocentric), and cultural differences.
The taxonomy references L0 “foundational multimodal capabilities,” but these are not defined or evaluated. Specify L0 abilities (e.g., OCR, grounding, temporal coherence) and measure their influence on L1–L4 performance.
Coverage gaps remain (despite SpatialPlus): e.g., occlusion reasoning, topological invariants beyond inside/outside/overlap, 3D rotational symmetry, multi-agent relational reasoning, and long-horizon partial observability. Add tasks explicitly targeting these abilities.
Conversion of complex reasoning datasets to “hybrid multi-option + LLM-as-a-Judge” may alter task demands and evaluation validity. Compare original task formulations to converted ones, quantify any shifts in difficulty or solution strategies.
The action mapping strategy aggregates embodiment-specific actions into unified key–mouse sequences; the mapping’s fidelity and ambiguity are not quantified. Report mapping error rates, reversibility, and ambiguity metrics; evaluate if mapping hinders learning of embodiment-specific affordances.
Data contamination risks are acknowledged only for a subset (robot arm) and not systematically audited across all tasks. Implement and publish rigorous decontamination checks (hash-based, scene-level, instruction-level) against known pretraining corpora for each evaluated model.
Average score weights and task difficulty normalization are not fully described; rankings may be sensitive to weighting. Publish the exact weighting scheme, justify choices, and include sensitivity analyses showing rank stability under alternative weights.
No compute, token budget, or latency analysis is provided for “thinking vs. perceiving” trade-offs. Measure inference-time costs, throughput impacts, and efficiency gains of auto-think gating; report compute-performance Pareto fronts.
The ability selection for SFT and RL focus (e.g., choosing three L1 abilities) was manual. Use principled selection (mutual information, causal discovery, feature importance) to pick abilities with maximal expected cross-level transfer.
The benchmark mixes single-image, multi-image, and video, but point clouds/meshes are only tangentially referenced; direct 3D inputs are not systematically evaluated. Add native 3D tasks (point clouds, meshes) to test true 3D spatial reasoning and mapping.
Tool-use at inference (e.g., external depth/pose estimators) is not discussed; models may benefit from calling tools for L1 tasks. Define tool-use protocols and evaluate how external perception tools impact the hierarchy without conflating pure MLLM ability.
Safety, alignment, and ethical considerations for agentic tasks are unaddressed (e.g., unsafe action sequences, privacy). Include safe RL constraints, risk-aware planning metrics, and evaluation in safety-critical scenarios.
Human cognitive alignment is asserted (Piagetian progression) but not empirically validated. Compare MLLM performance profiles to human benchmarks across age groups and task types to substantiate developmental parallels.
Scaling laws and sample efficiency for spatial abilities are unexplored. Derive data scaling curves per ability and identify diminishing returns or optimal curricula for hierarchical training.
The proposed hierarchy-aware reward design improves results, but its components (reward coefficients, length penalties) lack ablations. Publish detailed ablations to attribute gains to specific reward terms and quantify sensitivity.
The paper’s conclusions are cut short; future work directions (e.g., curriculum design, unified training schedules, memory architectures) are not elaborated. Provide a concrete research roadmap specifying stage-wise training regimes, modular architectures, and evaluation expansions.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications leverage the paper’s benchmark, training insights, and inference strategies and can be deployed with current MLLMs and tooling.

Industry (ML model development and evaluation)
- Use SpatialTree-Bench for capability-centric model selection, A/B testing, and continuous evaluation in CI pipelines.
- Sector: software, AI platforms
- Tools/workflows: integrate SpatialTree-Bench tasks and metrics; adopt the L1–L4 taxonomy in internal dashboards to diagnose weaknesses and track improvements
- Assumptions/dependencies: benchmark coverage matches target domains; licensing for datasets; LLM-as-a-Judge reliability for certain tasks
Robotics and automation (manipulation and navigation)
- Boost agentic performance by targeted SFT on L1 geometry (e.g., distance) and blended L1 skill training (distance+size+correspondence), then validate gains on L4 tasks.
- Sector: robotics, manufacturing, logistics
- Tools/workflows: SpatialEngine-generated QA for L1; blended SFT recipes; task-specific eval; domain adaptation for grippers/arms
- Assumptions/dependencies: sim-to-real transfer; sensor calibration; safety guardrails for execution
- Apply “auto-think” inference (suppress unnecessary reasoning) to reduce latency and improve precision in intuitive perception tasks (e.g., alignment, orientation, counting).
- Sector: robotics, warehouse automation
- Tools/workflows: gated Chain-of-Thought at inference; token-length penalties for perceptual tasks
- Assumptions/dependencies: accurate task-type detection; stable latency/runtime budgets
Vision-language GUI agents
- Execute screen-based action sequences by mapping vision observations to high-level key–mouse primitives for RPA-like workflows and game automation.
- Sector: software RPA, gaming
- Tools/workflows: action mapping to motion primitives; discrete multi-step policies; replay and safety filters
- Assumptions/dependencies: OS-level control permissions; robust visual grounding under UI variability
Video analytics and retrieval
- Build cognitive maps and spatial memory retrieval for multi-view/video content: index “where” and “when” events occurred; query by spatial relations.
- Sector: media analytics, sports, retail, security
- Tools/workflows: BEV map generation; spatial captioning; retrieval APIs
- Assumptions/dependencies: data rights/privacy; multi-camera synchronization; evaluation alignment across numeric and semantic metrics
Education and training
- Deploy spatial reasoning tutors and puzzle assistants using L2/L3 tasks; evaluate learners’ progression across the L1–L4 hierarchy.
- Sector: education, edtech
- Tools/workflows: curriculum aligned to SpatialTree; multi-format QAs (MCQ, numeric); progress dashboards
- Assumptions/dependencies: age-appropriate content; accessibility; guardrails for accuracy
Assistive technologies and daily life
- Smartphone apps for quick distance/size estimation, object finding, and navigation cues in indoor settings.
- Sector: consumer software, accessibility
- Tools/workflows: L1 geometry and localization; on-device inference with gated reasoning for speed
- Assumptions/dependencies: camera quality; edge compute constraints; user safety and privacy
Safety auditing and policy readiness
- Use the taxonomy to audit embodied AI systems and GUI agents for capability coverage and failure modes; adopt auto-think gating to mitigate overthinking-induced errors in perception.
- Sector: compliance, safety certification
- Tools/workflows: capability checklists; benchmark-driven reporting; inference policy configurations
- Assumptions/dependencies: acceptance by regulators; clear thresholds for pass/fail; reproducible test suites
Academic research
- Design hierarchical curricula and controlled experiments to study transfer: validate negative transfers within L1 and cross-level synergy; compare RL vs. SFT outcomes with auto-think.
- Sector: academia, research labs
- Tools/workflows: SpatialTree-Bench; targeted SFT data generation; GRPO-based RL with hierarchy-aware rewards
- Assumptions/dependencies: compute availability; data decontamination; standardized metrics

Long-Term Applications

These applications extend the framework toward robust embodied agents, standardized evaluation, and sector-specific deployments that require further research, scaling, or development.

Generalist spatial agents across embodiments
- Build a “Spatial Agent OS” that unifies perception, cognitive maps, simulation, and action execution for robots, drones, AR glasses, and GUI agents.
- Sector: robotics, AR/VR, smart devices
- Potential products: unified agent runtime with motion primitives; cross-embodiment policy libraries; memory and planning modules
- Assumptions/dependencies: reliable L4 execution; multi-sensor fusion; safety certifications; long-horizon memory
Hierarchy-aware RL frameworks
- Industrialize RL with verifiable rewards and auto-think controllers that dynamically gate reasoning tokens by capability level and task difficulty.
- Sector: AI platforms, MLOps
- Potential products: “Reasoning Governor” service; RLVR pipelines with level-specific reward shaping
- Assumptions/dependencies: robust difficulty classifiers; stable reward signals; generalization beyond training distributions
Standardization and policy
- Establish sector-wide standards for spatial intelligence evaluation and safety, borrowing the L1–L4 taxonomy to define certification tiers for embodied AI and GUI agents.
- Sector: policy, standards bodies (ISO/IEEE), public procurement
- Potential tools: certification benchmarks, conformance test suites, reporting templates
- Assumptions/dependencies: cross-stakeholder consensus; transparent benchmarks; independent testing bodies
Spatial memory cloud services
- Persistent cognitive maps for home robots and smart environments (e.g., shared memory of object locations and navigational affordances).
- Sector: smart home, IoT
- Potential products: privacy-preserving map storage, household multi-agent coordination
- Assumptions/dependencies: robust identity/correspondence across views; privacy and consent; seamless updates
Advanced healthcare and surgical assistance
- High-stakes spatial reasoning (L3–L4) for surgical robotics, AR-guided procedures, and rehabilitation planning.
- Sector: healthcare
- Potential products: surgical navigation assistants; patient-specific spatial planning
- Assumptions/dependencies: clinical validation; regulatory approvals; integration with medical imaging
Autonomous systems and logistics
- Use mental simulation and sequential planning for autonomous navigation in warehouses, factories, and last-mile delivery; cross-level training to improve route planning and manipulation.
- Sector: logistics, supply chain, mobility
- Potential products: warehouse route planners; multi-agent coordination; task-level controllers
- Assumptions/dependencies: high reliability; robust domain adaptation; operations safety
Disaster response and teleoperation
- Agentic competence for remote navigation, environment reconstruction, and manipulation via cognitive maps and simulation.
- Sector: public safety, defense
- Potential products: teleop assistants with spatial memory; autonomous scouting agents
- Assumptions/dependencies: bandwidth and latency constraints; extreme condition robustness; ethical governance
Educational assessment frameworks
- Longitudinal evaluation of spatial cognition using the hierarchy to track progression and tailor interventions; connect human assessments with model-based diagnostics.
- Sector: education policy, edtech research
- Potential tools: standardized spatial ability assessments; personalized learning pathways
- Assumptions/dependencies: fair, validated measures; inclusivity; data protection

Cross-cutting assumptions and dependencies

Data quality and coverage: Benchmarks and training data must match deployment domains; maintain rigorous data decontamination to avoid leakage.
Compute and latency: Auto-think gating depends on accurate task-type detection and token budgets; edge vs. cloud trade-offs for consumer and robotics use.
Reliability and safety: L4 agents require strong safety constraints, fail-safes, and regulatory compliance; GUI and physical agents need audit trails.
Generalization: Transfer effects observed (negative within L1, synergistic cross-level) may vary by domain and model family; continued validation is required.
Evaluation robustness: Numeric metrics and LLM-as-a-Judge protocols must be calibrated; align evaluation with end-user success criteria.

View Paper Prompt View All Prompts

Glossary

Affordance: The functional possibilities an object offers for interaction (e.g., grasping, sitting). "For other understanding tasks, such as affordance, we build on partially annotated datasets"
Allocentric: A world-centered frame of reference for perceiving external object motion and speed. "Allocentric (perceiving the movement and speed of external objects)"
Agentic Competence: The ability to translate plans into effective actions in dynamic environments. "Agentic Competence represents the culmination of spatial intelligence, bridging the gap between cognitive planning and practical execution."
Agentic decision-making: Decision-making that integrates perception and reasoning to act in environments. "combine perception and reasoning to support complex, agentic decision-making."
Auto-think strategy: A method that suppresses unnecessary reasoning to improve performance across tasks. "We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels."
BEV maps: Bird’s-eye-view maps, a top-down spatial representation derived from visual data. "to generate BEV maps from videos"
Causal Reasoning: Modeling cause–effect relationships in spatial interactions and dynamics. "Causal Reasoning allows MLLMs to model spatial interactions, physical dynamics, and entity relationships within a simulated mental space."
Chain-of-Thought (CoT): Explicit step-by-step reasoning generated by a LLM. "explicit Chain-of-Thought (CoT) reasoning"
Cognitive Map: An internal, global representation that integrates observations over time and views. "constructing a Cognitive Map, which synthesizes fragmentary observations (e.g., video frames or multi-view images) into a compact and unified global representation."
Correspondence: Recognizing the same object or landmark across different viewpoints or conditions. "Correspondence (recognizing the same object or landmark across different viewpoints or visual conditions)"
Data Decontamination: Ensuring no overlap between training and test data to avoid leakage. "Data Decontamination: The robotic arm data samples used for GRPO training are strictly separated from the SpatialTree-Bench testing data."
Egocentric: A self-centered frame of reference for sensing one’s own motion and heading direction. "Egocentric (sensing self-motion and heading direction)"
Group Relative Policy Optimization (GRPO): A reinforcement learning algorithm for optimizing policies with grouped comparisons. "we employ Group Relative Policy Optimization (GRPO)"
Grounding: Associating visual observations with spatial positions or coordinates. "Grounding (associating visual observations with spatial positions or coordinates)"
GUI Agents: Agents that interact with environments via graphical user interfaces using language actions. "like GUI Agents"
LLM-as-a-Judge: Using a LLM to evaluate or score outputs instead of fixed metrics. "converted to a hybrid multi-option + LLM-as-a-Judge format"
Mental Mapping: Aligning spatial perception with language and constructing language-structured memory. "Mental Mapping (L2) marks a shift toward alignment with language."
Mental Simulation: Internal reasoning that simulates spatial interactions before execution. "commonly referred to as mental simulation."
Multimodal LLMs (MLLMs): LLMs that process multiple input modalities such as text, images, and video. "In multimodal LLMs (MLLMs), these abilities form the cornerstone of Spatial Intelligence (SI)"
Pearson correlation coefficients: Statistical measures of linear correlation used to analyze ability dependencies. "using Pearson correlation coefficients computed from our benchmark scores."
Perspective Taking: Mentally adopting alternative viewpoints to interpret spatial scenes. "Perspective Taking—the ability to mentally align with alternative viewpoints—"
Reinforcement Learning with Verifiable Rewards (RLVR): RL approach where rewards are automatically validated against task criteria. "we further investigate the potential of Reinforcement Learning with Verifiable Rewards (RLVR)"
Sequential Planning: Designing step-by-step strategies and abstract routes to achieve goals. "Sequential Planning converts causal insights into coherent, goal-directed action plans expressed in language."
Spatial Captioning: Producing linguistic descriptions that capture spatial semantics of visual scenes. "through Spatial Captioning"
Spatial Engine: A data pipeline integrating expert models to generate annotations and task QAs. "we construct a Spatial Engine that integrates multiple expert models"
Topology: Basic spatial relations and configurations like inside, outside, and overlap. "Topology (perceiving basic spatial configurations such as inside, outside, or overlap)"
Vision-Language Action Models (VLAs): Models that decode vision-language inputs into low-level robotic control signals. "Unlike Vision-Language Action Models (VLAs) decording the low-level control signals in robotics"
Zero-shot: Generalizing to tasks without task-specific training or fine-tuning. "transfers in a zero-shot manner"

SpatialTree: How Spatial Abilities Branch Out in MLLMs

Summary

Hierarchical Taxonomy and Benchmarking of Spatial Intelligence in Multimodal LLMs

Introduction and Motivation

Taxonomy and Abilities

Benchmark Construction and Evaluation Protocols

Ability Dependencies and Transfer Dynamics

Reinforcement Learning and Hierarchy-Awareness

Benchmark Aggregation and Metric Weighting

Representative QA Examples and Agentic Task Curation

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (in plain language)

What questions did the researchers ask?

How they studied it (the approach, explained simply)

The main findings (what they discovered and why it matters)

Why this research matters (the bigger picture)

A simple way to picture it

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Authors (8)

Collections

Tweets

YouTube

SpatialTree: How Spatial Abilities Branch Out in MLLMs

Summary

Hierarchical Taxonomy and Benchmarking of Spatial Intelligence in Multimodal LLMs

Introduction and Motivation

Taxonomy and Abilities

Benchmark Construction and Evaluation Protocols

Ability Dependencies and Transfer Dynamics

Reinforcement Learning and Hierarchy-Awareness

Benchmark Aggregation and Metric Weighting

Representative QA Examples and Agentic Task Curation

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (in plain language)

What questions did the researchers ask?

How they studied it (the approach, explained simply)

The main findings (what they discovered and why it matters)

Why this research matters (the bigger picture)

A simple way to picture it

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

Tweets

YouTube