Generalization of Iso-Energy structural invariants beyond dual-encoder VLMs

Determine whether the structural invariants and alignment properties identified by the Iso-Energy–aligned sparse autoencoder in dual-encoder vision–language models also hold in models that employ cross-attention mechanisms or are trained with generative objectives.

Background

The paper introduces the Iso-Energy Assumption and an aligned sparse autoencoder to analyze the geometry of vision–language embeddings in dual-encoder models such as CLIP and SigLIP. The analysis reveals a decomposition into bimodal atoms that carry cross-modal alignment and unimodal atoms that capture modality-specific information, enabling interventions like closing the modality gap and improving semantic vector arithmetic.

All experiments and findings are reported for dual-encoder architectures. Extending these structural insights to other classes of vision–LLMs—those with cross-attention or trained with generative objectives—would test the generality of the Iso-Energy framework and clarify whether similar bimodal/unimodal decompositions and alignment behaviors arise in more integrated or generative multimodal systems.

References

Finally, our experiments are limited to dual-encoder vision–LLMs. Whether the same structural invariants and alignment properties hold in models with cross-attention mechanisms or generative training objectives remains an open question.

Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings  (2602.06218 - Dhimoïla et al., 5 Feb 2026) in Conclusion, final paragraph