Papers
Topics
Authors
Recent
Search
2000 character limit reached

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Published 29 Apr 2025 in cs.LG | (2504.20966v3)

Abstract: We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Code: https://github.com/zaydzuhri/softpick-attention.

Summary

  • The paper introduces Softpick, a rectified softmax alternative that eliminates attention sink and massive activations by decoupling normalization from strict probabilistic constraints.
  • The paper demonstrates that Softpick produces genuinely sparse attention maps and significantly improves quantization robustness, especially at low-bit precisions.
  • The paper reveals that while Softpick matches softmax in medium-scale models, its performance at larger scales and long-context scenarios indicates a need for further tuning.

Softpick: Eliminating Attention Sink and Massive Activations with Rectified Softmax

Introduction and Motivation

Transformer architectures rely on the softmax function to normalize attention scores, which yields dense, sum-to-one attention distributions. However, recent analyses have illuminated two pathological behaviors intrinsically linked to softmax-based attention: attention sink—where heads disproportionately attend to semantically weak (often initial) tokens—and massive activations—rare but extreme hidden state outliers that degrade quantization robustness and low-precision training. "Softpick: No Attention Sink, No Massive Activations with Rectified Softmax" (2504.20966) targets both issues by proposing a simple, parameter-free rectified alternative to softmax, maintaining most of softmax's Jacobian structure but decoupling normalization from the strict probabilistic constraints.

The Softpick Function: Formulation and Rationale

Softpick is defined as:

Softpick(x)i=ReLU(exi−1)∑j=1N∣exj−1∣+ϵSoftpick(\mathbf{x})_i = \frac{ReLU(e^{x_i} - 1)}{\sum_{j=1}^{N} |e^{x_j} - 1| + \epsilon}

where ϵ\epsilon is a small constant for safety, and the formulation is numerically stabilized by shifting logits with their max as in softmax implementations. The function introduces several key deviations:

  • Rectified Numerator: The ReLU allows sparse, exactly-zero attention scores for negative-logit positions (irrelevant tokens), in contrast to softmax's dense (positive, nonzero) scores.
  • Absolute-Value Denominator: This preserves nonzero gradient flow even for "off" positions and prevents vanishing derivatives for negative logits.
  • Non-sum-to-one Output: The normalization is no longer strictly probabilistic, eliminating the pressure that induces attention sink.

This formulation enables sparsity, richer head specialization, and removes the enforced coupling between attended and non-attended tokens' scores.

Empirical Analysis: Training, Benchmarks, and Activation Statistics

Convergence and Training Dynamics

Both 340M and 1.8B parameter transformer models were trained from scratch using softmax and softpick for direct comparison. Training losses and gradient norms for the 340M case are nearly identical, indicating that softpick largely preserves softmax's favorable training geometry and optimizer compatibility. Figure 1

Figure 1: Training loss and gradient norm during training of 340M models.

Notably, while gradient magnitudes for softpick are initially higher, they plateau smoothly and do not induce instability when using standard gradient clipping.

Downstream Task Performance

On general language modeling and reasoning tasks (ARC-e, Lambada, PIQA, SciQ, Wikitext-2), softpick and softmax achieve near-equal performance at 340M. At 1.8B, softpick underperforms across most metrics, suggesting nontrivial scaling or hyperparameter phenomena that inhibit effective capacity at higher scale (see Section: Scalability).

Quantization Robustness

Strong numerical claim: Softpick enables dramatically more robust quantization, especially at 2–4 bits, across post-training quantization methods (BNB, GPTQ, HQQ). Figure 2

Figure 2: Quantization results of softmax vs. softpick across model scales (2, 3, 4, & 8-bit precision).

For 340M models, softpick-quantized variants outperform softmax by up to 6 points in accuracy and avoid catastrophic loss of function at aggressive bit reductions. The improvement is less pronounced but still present at larger scales and lower bit depths, despite softpick's lower starting accuracy at 1.8B.

Activation Distribution and Attention Sink Metrics

Softpick yields genuinely sparse attention maps, with >99% zeros in the 340M model and >95% in 1.8B—orders of magnitude increase vs. softmax, which only returns zeros via numerical underflow. Figure 3

Figure 3: (Top) Comparison between the attention maps when using softmax vs softpick and overall sink rate of the 340M models. (Bottom) Largest hidden state activation per layer of the 340M models.

Critically, sink rates (proportion of heads assigning large scores to the BOS token) fall from >60% (softmax) to exactly zero with softpick, eliminating one of the most persistent attention artifacts. Activation kurtosis in hidden states falls by 100x, and maximum hidden activation values are reduced by an order of magnitude. Figure 4

Figure 4: Attention maps of softmax and softpick 340M models on 2 different input texts. Two heads are visualized: Head 1 of Layer 11 and Head 2 of Layer 21.

Softpick attention heads are either sharply selective to relevant position patterns or completely dormant, enabling more interpretable and prunable head behavior.

Theoretical Implications and Head Specialization

The softpick Jacobian retains the desirable training properties of softmax for positive-scoring positions, but its rectification and non-conservation of normalization permit heads to output null (zero) attention—with persistent gradient flow enabling dead heads to recover if needed. This mechanism aligns with new lines of research into active-dormant head specialization and pruning.

Softpick's compatibility with single-pass FlashAttention kernels further ensures its utility in high-efficiency large-scale environments.

Scalability and Limitations

Despite its merits at moderate scale, two limitations arise:

  • Scaling to Large Models: At 1.8B, softpick's accuracy lags, driven primarily by head underutilization (a higher proportion of persistent dead heads) and signal underscoring in long-context scenarios. Monitoring dead head rates across training (Figure 5) reveals that more powerful models require either more training data or additional architectural tuning for softpick to realize its full capacity.
  • Long-Context Behavior: Softpick's sharp sparsity sometimes weakens key-value signals where only a handful of tokens are relevant, especially with increasing context length. Initial attempts at scaling the pre-activation or learning per-head normalizers have not overcome this. Figure 5

    Figure 5: Percentage of Dead Head on Softpick on 340M and 1.8B Parameters Across Training Steps.

Practical Implications and Future Research Directions

By eliminating activation outliers and enforcing sparsity at the attention level, softpick obviates the need for complex outlier-aware quantization and enables more aggressive, efficient hardware-friendly low-precision training. The sparsity acts as a lever for optimizing attention computations—potentially accelerating inference via sparse matrix multiplies and unlocking more assertive head/row pruning strategies.

Interpretability is also enhanced as legible, zero-sparse attention maps replace blurry, sink-dominated ones, facilitating model analysis and potentially improving safety audits. The softpick paradigm generalizes to vision and multimodal transformers, as massive activation artifacts are a cross-domain phenomenon.

Conclusion

Softpick provides a simple, parameter-free mechanism to eliminate two central softmax-related pathologies: attention sink and massive activations. While performance matches softmax in medium-scale settings and improves quantization robustness, scaling and long-context efficacy require further investigation. The promising properties of softpick—sparsity, interpretability, prunability, and quantization-friendliness—suggest rich opportunities for future architectural and theoretical advances in transformer-based models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 18 tweets with 2038 likes about this paper.