Conjecture on the effects of the equal residual warmup schedule
Establish whether the equal residual warmup schedule in Progressive Residual Warmup, defined by α(l, t) = min(t/T, 1) for all Transformer layers l with warmup length T, primarily delays chaotic updates at initialization without enforcing a sequential dependency across layers, and determine whether this simultaneous scaling of all residual branches amplifies representation and gradient noise during Transformer pretraining.
References
We conjecture that the ``equal'' schedule primarily delays chaotic updates at initialization, but fails to respect the sequential dependency across layers; as a result, simultaneously scaling up all residuals may amplify representation and gradient noise.
— Progressive Residual Warmup for Language Model Pretraining
(2603.05369 - Chen et al., 5 Mar 2026) in Warmup Schedule Ablation — Warmup length and interaction with layer order