Per-module hyperparameter gains and transfer
Determine whether tuning hyperparameters on a per-module basis yields significant performance gains and whether those per-module hyperparameters empirically transfer across model scaling when using a parameterisation that enables hyperparameter transfer across width and depth (for example, μP, Depth-μP, or CompleteP) in transformer architectures.
References
However, two questions remain open: Can tuning HPs on a per-module basis give significant gains, and do the per-module HPs empirically transfer with the right parameterisation?
— Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
(2512.22382 - Mlodozeniec et al., 26 Dec 2025) in Section 1 (Introduction)