Per-module hyperparameter gains and transfer

Determine whether tuning hyperparameters on a per-module basis yields significant performance gains and whether those per-module hyperparameters empirically transfer across model scaling when using a parameterisation that enables hyperparameter transfer across width and depth (for example, μP, Depth-μP, or CompleteP) in transformer architectures.

Background

The paper discusses parameterisations such as μP, Depth-μP, and CompleteP that enable transferring optimal global hyperparameters across model sizes. Building on these, the authors pose whether more granular, per-module hyperparameter tuning can provide benefits and whether such hyperparameters also transfer under the right parameterisation when scaling model width and depth.

This question motivates their empirical investigation into per-module hyperparameter optimisation and transfer, aiming to handle the increased dimensionality and complexity of the hyperparameter landscape while leveraging principled scaling rules.

References

However, two questions remain open: Can tuning HPs on a per-module basis give significant gains, and do the per-module HPs empirically transfer with the right parameterisation?

— Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration (2512.22382 - Mlodozeniec et al., 26 Dec 2025) in Section 1 (Introduction)

Per-module hyperparameter gains and transfer

Background

References

Related Problems