Efficient Training of Agglomerative Multi-Teacher VFMs

Determine whether agglomerative Vision Foundation Models trained via multi-teacher distillation can be trained more efficiently within a standardized framework while preserving or improving representational quality.

Background

Agglomerative Vision Foundation Models combine complementary capabilities from multiple teacher models (e.g., self-supervised and vision–language encoders) via multi-teacher distillation. While promising, prior approaches have been computationally expensive and data-hungry, with complex handling for varying resolutions and multiple loss functions.

This work proposes components such as token-balanced batching, hierarchical data curation (OpenLVD200M), and Asymmetric Relational Knowledge Distillation to improve efficiency. However, the broader question of achieving efficient training in a standardized framework while maintaining or improving representational quality is explicitly raised as an open question, motivating the study.

References

A key open question is whether such models can be trained more efficiently in a standardized framework while preserving or even improving their representational quality.

AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model  (2512.20157 - Chaybouti et al., 23 Dec 2025) in Section 1 (Introduction)