Mechanistic explanation for square-geometry stabilization

Establish a mechanistic explanation for why using square input geometry improves the language model’s utilization of spatial cues in the vision–language interface, which in experiments mitigates localization collapse when applying detection-pretrained VMamba vision encoders within a LLaVA-style vision–language model.

Background

The paper identifies a failure mode termed localization collapse in some high-resolution detection-adapted settings. Empirically, switching from a non-square input (e.g., 1333×800) to a square input (e.g., 512×512) eliminates collapse and improves both localization and VQA for detection-pretrained VMamba checkpoints.

While the authors hypothesize that non-square geometry may hinder the LLM’s ability to interpret spatial cues from visual tokens, the underlying mechanism remains to be clarified and is deferred for future analysis.

References

We leave a deeper mechanistic analysis of why square geometry improves utilization to future work.

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders  (2603.19209 - Kuo et al., 19 Mar 2026) in Subsection "Utilization Bottleneck Test," Section 4.5.3