Explain the depth-specific re-emergence of row-wise outliers near block 22 in K-projection gradients

Determine why the depth-dependent transition in gradient outlier patterns occurs specifically around block 22 in Llama3.2-3B, namely why row-wise outliers re-emerge in the key (K) projection gradients after middle layers exhibit the None pattern, despite early layers showing row-wise outliers. Provide a mechanistic explanation relating transformer depth, attention dynamics, and gradient aggregation that accounts for this block-22 transition point.

Background

The paper analyzes outlier patterns (Row-wise, Column-wise, None) across weights, activations, and gradients in LLMs and observes that patterns are stable over training but vary across layers. In Llama3.2-3B, a depth-dependent evolution is reported: early blocks show row-wise outliers in K, V, and O projection gradients; middle blocks transition to the None pattern; and late blocks see a re-emergence of row-wise outliers specifically in the K-projection gradient.

The authors hypothesize that attention near the output concentrates on a small set of anchor tokens, creating an asymmetry between key and query gradients—K gradients aggregate contributions across queries, which can produce row-wise concentration, whereas Q gradients distribute independently across queries. However, the precise reason the re-emergence occurs around block 22 remains unexplained and is identified as a target for future investigation.

References

Why this transition occurs specifically around block~22 is left for future investigation.

AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation  (2604.02525 - Kim et al., 2 Apr 2026) in Appendix, Section 'Outlier Patterns of Llama3.2-3B'