Principled proxy model sizing for modern transformer architectures
Establish principled guidelines for selecting the architecture and parameter scale of proxy models relative to large target transformer models in language model pretraining, and extend theoretical analysis of proxy-to-target transferability from random feature models to modern transformer architectures.
References
Our theoretical analysis in Section \ref{sec:method-theory} for random feature models provides initial insights into width requirements, but extending this understanding to modern transformer architectures and establishing principled guidelines for proxy model sizing remains an open challenge.
— Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice
(2512.24503 - Wang et al., 30 Dec 2025) in Appendix, Extended Background and Related Works, Remark (Architecture and scale considerations)