Necessity of a 3D spatial latent representation for mental rotation

Determine whether rotation actions in the proposed mechanistic model of mental rotation must operate on the 3D spatial latent representation produced by the Equivariant Neural Renderer, or whether applying rotation actions directly to the neuro-symbolic sequence representation generated by the Vision Symbolic Model suffices to achieve accurate similarity judgments; specifically, ascertain if the 3D latent space is necessary for solving the Shepard–Metzler mental rotation task within this architecture.

Background

The paper introduces a three-module model for human mental rotation: an Equivariant Neural Renderer (EqNR) that yields a manipulable 3D spatial latent, a Vision Symbolic Model (VSM) that maps the spatial latent to a symbolic sequence under the Quadrant Hypothesis, and a decision MLP that selects rotation actions or similarity judgments based on pairs of symbolic descriptions.

While the authors argue for a hybrid spatial–symbolic account that matches behavioral signatures from their VR experiments, they note that in principle rotation actions could be implemented directly in symbolic space without using the 3D spatial latent. This raises the question of whether the spatial latent is strictly required for the task or if a purely symbolic pathway could suffice, which has implications for both cognitive theory and model design.

References

We note that in our model the rotation actions could also have been directly applied to the symbolic representations, and we do not know whether the 3D latent space is fully needed.

A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments  (2512.13517 - Khazoum et al., 15 Dec 2025) in Section 6 Discussion (Symbolic or Spatial representations?), page unspecified