On Vanishing Variance in Transformer Length Generalization

Published 3 Apr 2025 in cs.LG and cs.AI | (2504.02827v1)

Abstract: It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.

Abstract PDF Upgrade to Chat

Summary

The paper’s main contribution is the formal proof that attention output variance decays with increasing sequence length, leading to significant distribution shifts.
It shows that applying layer normalization immediately after attention outputs effectively mitigates these shifts, stabilizing model performance on longer sequences.
Empirical results on algorithmic tasks confirm that normalization interventions enhance out-of-distribution accuracy, even when training occurs on much shorter sequences.

On Vanishing Variance and Transformer Length Generalization

Introduction

The paper "On Vanishing Variance in Transformer Length Generalization" (2504.02827) addresses a persistent limitation of Transformer architectures: robust generalization to sequence lengths longer than those encountered during training. Despite their centrality in deep learning across domains, Transformers often fail at length generalization, challenging the understanding of their underlying reasoning abilities. This work presents both theoretical and empirical foundations for the phenomenon of vanishing variance in attention outputs as sequence length increases and analyzes the efficacy of normalization strategies in mitigating associated distribution shift and generalization failure.

Vanishing Variance in Attention Outputs

The central theoretical contribution is a formal proof that, under reasonable assumptions (i.i.d. sequence elements and zero-mean value projections), for a fixed query and output dimension the variance across the attention outputs decays to zero as sequence length tends to infinity. This vanishing variance leads to a distribution shift for the attention outputs when Transformers are exposed to longer sequences at inference time—a scenario typical in algorithmic and language modeling settings.

Empirical experiments confirm that this decay is not merely a theoretical artifact but persists in frontier models such as Llama-3.2-1B, even when assumptions like strict independence are relaxed (e.g., in real text with positional encodings).

Figure 1: Standard deviation of a fixed component in attention outputs from the first layer of Llama-3.2-1B as a function of sequence length, illustrating $\sigma \propto N^{-0.5}$ scaling in the i.i.d. token regime and a clear downward trend with position encodings and real sentences.

By analyzing the variance of attention outputs for multiple synthetic data regimes (i.i.d. tokens, random tokens with and without positional encoding, and natural text), the study shows a consistent decrease in output variance, culminating in sharply reduced model accuracy on long sequences.

Distribution Shift and Layer Normalization

This vanishing variance translates concretely into a significant distribution shift in the activations fed into downstream MLP layers. For models trained on short sequences, attention output features become sharply concentrated around their mean as input length increases, reducing their informativeness and hampering the inductive bias needed for extrapolation.

Figure 2: Histogram of five attention output feature components across batches, visualizing concentration and variance reduction as sequence length grows.

The global mean and global variance of the feature vectors also exhibit a pronounced drift and decay, respectively, as sequence length deviates from the model's training regime.

Figure 3: (Left) Drift in global mean normalized by training variance with increasing sequence length; (Right) decay in global variance, both showing the stabilizing effect of layer normalization.

The analysis demonstrates that standardization and, in particular, layer normalization applied immediately after the attention outputs stabilizes not only per-component statistics but the global distributional characteristics of the attention output vectors. This reduces the distribution shift, keeping the inputs fed to the MLP within the distribution experienced during training.

Experimental Validation and Ablation

On the $\operatorname{argmax}$ retrieval and dictionary lookup tasks—selected for their order invariance and algorithmic nature—layer normalization consistently improves out-of-distribution (OOD) generalization to sequence lengths up to $2^{14}$ , even when the in-distribution training length is only $2^4$ or $2^8$ .

Furthermore, the gains from normalization are largely independent of test-time adaptation or architectural changes to the softmax function, highlighting the orthogonality and general utility of this intervention.

Figure 4: Heatmap showing that layer normalization mitigates attention weight dispersion as sequence length increases, addressing the issue outlined by prior work.

Ablation studies reveal that even non-learnable standardization confers meaningful improvements over the baseline, though layer normalization (with learnable scale and shift) consistently yields the best performance in both tasks, as measured by accuracy and supported by strong statistical evidence (significant $p$ -values reported).

Implications and Future Directions

The findings clarify why even cutting-edge Transformers exhibit brittleness in length generalization despite sophisticated architectures and positional encodings. Norm-based interventions immediately after attention outputs, although not a panacea, address a core limitation associated with the distributional collapse of outputs as sequence length increases.

In practical terms, the results suggest revisiting normalization placement within existing and future model designs, especially for extrapolative use cases such as algorithmic reasoning, mathematical problem solving, or code synthesis. Theoretical implications span a deeper need to design architectures and derivations that are provably invariant—or at least robust—to input length.

Open research directions include extending the analysis to deeper Transformer stacks with multi-head attention, validating on realistic benchmarks such as CLRS, and developing new normalization or architectural mechanisms that fully eliminate vanishing variance or otherwise guarantee invariance to input length.

Conclusion

This paper rigorously identifies, theoretically analyzes, and empirically demonstrates the vanishing variance phenomenon in Transformer attention outputs as a root cause of poor sequence length generalization. Normalization applied directly after attention effectively mitigates—though does not eradicate—the associated distribution shift and improves OOD performance on algorithmic tasks, offering actionable insight for Transformer design and future research on robust extrapolation.