Conjecture on the effects of the equal residual warmup schedule

Establish whether the equal residual warmup schedule in Progressive Residual Warmup, defined by α(l, t) = min(t/T, 1) for all Transformer layers l with warmup length T, primarily delays chaotic updates at initialization without enforcing a sequential dependency across layers, and determine whether this simultaneous scaling of all residual branches amplifies representation and gradient noise during Transformer pretraining.

Background

To coordinate layerwise learning in Transformers, the paper proposes Progressive Residual Warmup (ProRes), which multiplies each layer’s residual by a time- and layer-dependent scalar α(l, t). Several schedules are considered, including a linear, layer-progressive schedule and an equal schedule that applies the same α(t) to all layers simultaneously.

Ablation experiments show that the equal schedule is sensitive to warmup length and can degrade performance or lead to divergence, motivating a conjectured mechanism. The authors hypothesize that the equal schedule may only postpone early instability without respecting the sequential dependency of stacked layers, potentially increasing noise when all residuals are scaled up together.

References

We conjecture that the ``equal'' schedule primarily delays chaotic updates at initialization, but fails to respect the sequential dependency across layers; as a result, simultaneously scaling up all residuals may amplify representation and gradient noise.

Progressive Residual Warmup for Language Model Pretraining  (2603.05369 - Chen et al., 5 Mar 2026) in Warmup Schedule Ablation — Warmup length and interaction with layer order