Convergence Mechanism and Limit of Unimodal Representations

Determine the mechanism by which representations learned by unimodal foundation models converge across architectures, objectives, and modalities, and identify the limiting representation(s) to which these learned representations ultimately converge.

Background

The paper surveys evidence that unimodal foundation models trained on different data and objectives exhibit convergent representational behavior across modalities. Despite this, the precise dynamics of how such convergence occurs and the identity of the limiting representation remain unsettled.

To address this gap, the authors introduce the Indra Representation Hypothesis and a category-theoretic instantiation via the V-enriched Yoneda embedding, proposing a relational profile as a candidate convergent form. However, this theoretical proposal does not resolve the broader open question of the exact convergence mechanism and the definitive convergence target across unimodal models.

References

Despite this emerging consensus, it remains unclear how these representations converge and what they ultimately converge to.

The Indra Representation Hypothesis for Multimodal Alignment  (2604.04496 - Lu et al., 6 Apr 2026) in Section 2.2 Representation Convergence