Mechanistic explanation for square-geometry stabilization
Establish a mechanistic explanation for why using square input geometry improves the language model’s utilization of spatial cues in the vision–language interface, which in experiments mitigates localization collapse when applying detection-pretrained VMamba vision encoders within a LLaVA-style vision–language model.
References
We leave a deeper mechanistic analysis of why square geometry improves utilization to future work.
— Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
(2603.19209 - Kuo et al., 19 Mar 2026) in Subsection "Utilization Bottleneck Test," Section 4.5.3