Explain the depth-specific re-emergence of row-wise outliers near block 22 in K-projection gradients
Determine why the depth-dependent transition in gradient outlier patterns occurs specifically around block 22 in Llama3.2-3B, namely why row-wise outliers re-emerge in the key (K) projection gradients after middle layers exhibit the None pattern, despite early layers showing row-wise outliers. Provide a mechanistic explanation relating transformer depth, attention dynamics, and gradient aggregation that accounts for this block-22 transition point.
References
Why this transition occurs specifically around block~22 is left for future investigation.
— AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation
(2604.02525 - Kim et al., 2 Apr 2026) in Appendix, Section 'Outlier Patterns of Llama3.2-3B'