To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

This presentation dissects two persistent but puzzling phenomena in Transformer language models: massive activations—extreme outliers in select token channels—and attention sinks—tokens that mysteriously hoard attention across layers. Through mechanistic analysis and targeted ablation studies, the authors reveal these aren't functional necessities but architectural artifacts of pre-norm design. The talk traces how specific feed-forward blocks amplify outliers, how normalization collapses spike tokens into fixed directions, and how attention heads exploit this geometry to construct stable sinks. The findings illuminate a path toward more efficient, interpretable architectures free of these training-driven workarounds.

Script

Language models harbor two stubborn mysteries: certain tokens develop massive activation outliers that break quantization, while other tokens become attention sinks that hoard focus across the entire network. For years, these seemed like inseparable emergent properties of scale. This paper proves they're actually independent architectural accidents.

Massive activations show up as dramatic channel spikes in intermediate layers, often hitting magnitudes orders larger than typical values. Attention sinks manifest as delimiter or first tokens receiving far more attention than their semantic role would justify. Both wreak havoc on inference efficiency and model compression.

So where do these activation monsters actually come from?

The researchers discovered a three-act drama. Early feed-forward blocks amplify certain input directions by orders of magnitude, creating spike channels. These spikes ride the residual connection through the middle of the network unchanged. Then, near the output, other blocks inject precisely matching negative spikes that cancel them out. The culprit? SwiGLU blocks operating in a near-identity regime with rank-one directional gain for delimiter tokens.

This visualization reveals the geometric heart of attention sinks. On the left, a sink head's query and key vectors show tight clustering: spike token keys occupy an isolated region, and queries reliably project near them, creating stable, large logit advantages. The non-sink head on the right shows no such structure. The key insight is that normalization after massive activations collapses spike tokens into nearly identical directions, giving attention heads a fixed target to exploit.

The ablation studies deliver the decisive blow: you can eliminate one phenomenon without touching the other. Adding post-block normalization kills massive activations entirely but leaves sinks intact. Introducing learnable gating in attention removes the need for sinks but doesn't affect spike formation. This proves they're architecturally linked but functionally independent, not two sides of the same coin.

What looked like fundamental emergent properties turn out to be workarounds the model learned because the architecture left it no better option. Understanding this distinction opens the door to Transformers that are simultaneously more efficient, more interpretable, and freed from these training-driven crutches. Visit EmergentMind.com to explore this research further and create your own videos.