Principled proxy model sizing for modern transformer architectures

Establish principled guidelines for selecting the architecture and parameter scale of proxy models relative to large target transformer models in language model pretraining, and extend theoretical analysis of proxy-to-target transferability from random feature models to modern transformer architectures.

Background

The paper studies how small proxy models can reliably guide data curation decisions for large-scale LLM pretraining and provides theoretical support using random feature models for training with tiny learning rates. While these results offer initial insights into width requirements, they do not address how to choose proxy architectures or sizes in modern transformer-based settings.

The authors explicitly note that extending theoretical understanding from random feature models to transformers and deriving principled sizing rules for proxy models remains unresolved. This is crucial for practitioners who rely on small-scale experiments to select data recipes that will generalize to large, tuned models.

References

Our theoretical analysis in Section \ref{sec:method-theory} for random feature models provides initial insights into width requirements, but extending this understanding to modern transformer architectures and establishing principled guidelines for proxy model sizing remains an open challenge.

Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice  (2512.24503 - Wang et al., 30 Dec 2025) in Appendix, Extended Background and Related Works, Remark (Architecture and scale considerations)