Papers
Topics
Authors
Recent
Search
2000 character limit reached

Query-Key Normalization in Transformers

Updated 27 November 2025
  • Query-Key Normalization is a set of techniques that normalizes and scales query and key vectors to prevent gradient instabilities in attention mechanisms.
  • It modifies the attention computation by applying LayerNorm or l₂ normalization to query and key projections, yielding improved convergence and performance.
  • Empirical studies show that QK-Norm variants allow higher learning rates and reduce perplexity, enhancing results across translation, vision, and language tasks.

Query-Key Normalization (QK-Norm) encompasses a set of normalization techniques designed to improve the stability, expressivity, and controllability of attention-based neural architectures, with emphasis on transformers. QK-Norm methods intervene in the attention mechanism by explicitly normalizing, scaling, or otherwise regularizing the query and key vectors prior to the computation of attention logits. This reparameterization constrains the magnitude of attention scores, mitigates gradient instabilities, and yields significant empirical gains in LLMs, translation, vision, and linear-attention variants. The techniques appear in distinct mathematical formulations, all sharing the core principle of attenuating the effects of unconstrained growth in query/key activations and their associated dot products.

1. Mathematical Formulation and Variants

QK-Norm standardizes attention by inserting normalization operations on queries and keys immediately before or after linear projections, with several precise variants realized in the literature.

Let XRB×T×dX \in \mathbb{R}^{B \times T \times d} be the input for batch size BB, sequence length TT, and per-head dimension dd. The canonical attention logits for queries Q=XWQQ=XW^Q and keys K=XWKK=XW^K are: Lij=QiKjdL_{ij} = \frac{Q_i \cdot K_j}{\sqrt d} QK-Norm replaces QQ and KK by normalized versions: Q=LN(Q),K=LN(K)Q' = \mathrm{LN}(Q),\quad K' = \mathrm{LN}(K) where

LN(v)=γvμ(v)σ2(v)+ε+β\mathrm{LN}(v) = \gamma \odot \frac{v - \mu(v)}{\sqrt{\sigma^2(v) + \varepsilon}} + \beta

with learnable parameters γ,β\gamma, \beta, and small ϵ\epsilon for stability. The normalized logits are then used in the softmax: Lij=QiKjd,Aij=exp(Lij)k=1Texp(Lik)L'_{ij} = \frac{Q'_i \cdot K'_j}{\sqrt d}\,, \qquad A_{ij}=\frac{\exp(L'_{ij})}{\sum_{k=1}^{T}\exp(L'_{ik})} Softmax capping (QK_norm_cap) applies a further nonlinearity to the logits prior to the softmax: tcap(x;c)=tanh(xc)c\mathrm{tcap}(x;c) = \tanh\left(\frac{x}{c}\right)\cdot c QK-Norm admits structural variants:

  • QK-Norm: Pre- and post-QKV LayerNorm.
  • QKV-Norm: Single LayerNorm post-QKV linear, no pre-normalization.
  • QK_FC_Norm: QK-Norm plus normalization on Proj and FC2 layers.
  • QK-Norm+Softmax Cap (QK_norm_cap): QK-Norm with capped softmax logits (Rybakov et al., 2024).

Alternative formulations use 2\ell_2-normalization (Henry et al., 2020), RMS normalization (Anson et al., 26 Nov 2025), or strict projection to the hypersphere with per-dimension scaling (Loshchilov et al., 2024). In linear-attention, norm-aware kernels decouple norm and direction, so as to recover norm-dependent entropy reduction and expressive similarity functions (Meng et al., 26 Jun 2025).

2. Role in Transformer and Modern Architectures

QK-Norm directly adjusts the attention computation in pre-LN and post-LN transformer blocks. In pre-LN blocks, QK-Norm is typically introduced immediately after the Q/K projections, affecting only the Q and K channels prior to dot-product and softmax. It does not alter the value pathway, residual additions, or downstream projections (Rybakov et al., 2024). In hyperspherical or normalized transformer schemes (nGPT), normalization extends to all learnable weights, embeddings, and activations, ensuring every token update resides on or near the unit sphere (Loshchilov et al., 2024).

Linear-attention adaptations, such as NaLaFormer, address the loss of norm information in kernelized attention by explicitly introducing norm–direction separation and norm-preserving mappings in the attention mechanism. Each query and key is represented as v=vv^v = \|v\|\hat{v}, with attention kernels k(q,k)=f(q,k)g(q^,k^)k(q,k) = f(\|q\|,\|k\|)g(\hat{q},\hat{k}) recovering norm-driven spikiness akin to softmax attention (Meng et al., 26 Jun 2025).

3. Empirical Effects: Stability, Learning Rate, and Performance

QK-Norm offers significant improvements in training stability, particularly at high learning rates. By bounding the L2 or RMS norm of queries and keys, exponential growth of attention logits is arrested, mitigating divergence seen in unnormalized regimes. In controlled comparisons:

  • QK-Norm and its variants permit a 1.5×1.5\times higher stable learning rate relative to baseline LayerNorm architectures (4e26e24\mathrm{e}{-2}\to6\mathrm{e}{-2}) (Rybakov et al., 2024).
  • Perplexity is consistently reduced; e.g., QK_norm_cap achieves a 3% lower PPL than bf16 baseline.
  • QKV-Norm and QK_norm_cap confer nearly identical performance improvements, with the latter obtaining best-in-class results.
  • In low-resource translation (bilingual TED/IWSLT), QK-Norm boosts BLEU by an average of +0.93+0.93 compared to ScaleNorm/PreNorm baselines (Henry et al., 2020).
  • In vision and LLMs, norm-aware linear attention yields accuracy gains up to 4.2%4.2\% (e.g., +2.1%+2.1\% on ImageNet-1K over PolaFormer) and reduced LLM perplexity (e.g., $1.3$ PPL on WikiText) (Meng et al., 26 Jun 2025).

A summary of empirical observations:

Variant Max Stable LR Perplexity Key Outcome
bf16 baseline 6e36\mathrm{e}{-3} 11.19 Fails at moderate LR
QK-Norm 4e24\mathrm{e}{-2} 11.00 Stable, modest PPL gain
QKV-Norm 6e26\mathrm{e}{-2} 10.85 Best LR, improved PPL
QK_norm_cap 6e26\mathrm{e}{-2} 10.84 Best PPL, best stability

(Rybakov et al., 2024)

4. Limitations, Alternatives, and Scope of Applicability

QK-Norm requires full materialization (and normalization) of each query and key vector prior to the attention matrix calculation. This restriction makes QK-Norm inapplicable to approaches like Multi Latent Attention (MLA), where low-rank factorization is used and the full head-dimensional Q/K vectors are not formed at inference time (Anson et al., 26 Nov 2025). In such settings, alternative optimization-based schemes (e.g., QuacK: per-head learning rate scaling inversely proportional to complementary weight norms) are preferable. QuacK matches QK-Norm's stability in standard multi-head attention and outperforms other methods in MLA, with a computational cost approximately 10%10\% lower due to omitted normalization operations.

A plausible implication is that QK-Norm is most efficient and effective in standard multi-head attention and any architecture where queries and keys are readily available for normalization. Its use is contraindicated in factorized, low-memory regimes.

5. Theoretical Motivation, Expressivity, and Attention Dynamics

QK-Norm methods act both as variance stabilizers and as regularizers ensuring that dot products do not enforce arbitrary scaling or magnitude "hacking." By constraining queries and keys to norm-bounded (or unit-norm) sets, all dot products become cosine similarities, preventing "winner-take-all" collapse of the attention softmax and supporting more uniform and expressive attention distributions. In 2\ell_2-normalized approaches, the introduction of a learnable scaling parameter (e.g., gg or α\alpha) allows the model to recover a complete range of attention sharpness, avoiding loss of representational power (Henry et al., 2020).

In norm-aware linear attention, the recovery of entropy control previously lost in conventional linear kernels is analytically attributed to explicit retention of norm-driven spikiness, ensuring that as q\|q\| grows, the corresponding row of attention becomes more peaked (Meng et al., 26 Jun 2025).

In normalized transformers (nGPT), normalization steps are interpreted as Riemannian "retractions"—each attention/MLP block proposes a parameter update, and the output is projected back to the hypersphere, supporting optimization with stable geometry and preventing ill-conditioning of weight matrices (Loshchilov et al., 2024).

6. Best Practices and Recommendations

Empirical results consistently favor QK-Norm or its variants for enhanced training stability and better downstream results. Recommended configurations are:

  • QKV-Norm: Place a single LayerNorm immediately after QKV, omitting pre-norm, for maximal simplicity and performance.
  • QK_norm_cap: Combine standard QK-Norm with softmax capping to address both input-magnitude and output-range instabilities.
  • For low-resource or translation tasks, retain PreNorm/LN on sublayers and embeddings; initialize scaling parameters according to training-sequence statistics (Henry et al., 2020).
  • In linear attention, use a norm–direction separable kernel (NaLaFormer-style) to avoid entropy suppression.
  • Where QK-Norm is inapplicable (MLA, factorized attention), employ optimizer-level interventions as in QuacK (Anson et al., 26 Nov 2025).

When implemented, QK-Norm and successors improve convergence speed, permit significant increases in learning rate, and consistently yield reductions in perplexity, BLEU, and error metrics across diverse tasks and model scales.

7. Comparative Assessment and Evolution

QK-Norm stands out for its simplicity, empirical robustness, and negligible computational overhead in settings with standard attention. Competing methods such as σ\sigmaReparam, LayerScale, or direct softmax temperature manipulation either result in lower stable learning rates or fail to recover perplexity improvements matched by QK-Norm and its derivatives (Rybakov et al., 2024). Logical progression has extended QK-Norm from original attention-centric settings into all layers of newer architectures (nGPT) and into kernelized linear attention with explicit norm-awareness (NaLaFormer), covering a broad array of modalities and computational regimes.

A plausible future direction is the generalization of norm-based normalization into increasingly factorized or quantized attention pathways, and the mathematical study of norm-control as a means of regularizing entire trainable networks.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Query-Key Normalization (QK-Norm).