Horizontal Attention Gradient Dynamics

Updated 24 January 2026

Horizontal Attention Gradient is the derivative of loss with respect to attention scores, clarifying how error signals redistribute across tokens.
The mechanism reallocates attention mass from tokens with negative advantage to those with positive advantage, resembling an EM-like specialization process.
It enables practical interventions like attention-patching to steer model updates, enhancing interpretability and aligning gradient flow with Bayesian reasoning.

Horizontal attention gradients characterize the mechanism by which transformer attention heads redistribute error signals across positions (tokens) in sequence models during backward propagation. In contrast to vertical gradients, which aggregate error signals across queries for a fixed key or value, horizontal gradients explain how a given query updates its allocation of attention across all available keys and values. These gradients offer a principled account of how attention routing, content specialization, and internal geometry evolve during optimization, connecting the mathematical details of gradient flow to statistical procedures such as the expectation-maximization (EM) algorithm and Bayesian reasoning.

1. Definitions and Fundamental Quantities

Consider a single-head attention block with input sequence length $T$ and input vectors $x_j \in \mathbb{R}^{d_x}$ . Projections yield query, key, and value vectors: $q_i = W_Q x_i \in \mathbb{R}^{d_k}$ , $k_j = W_K x_j \in \mathbb{R}^{d_k}$ , and $v_j = W_V x_j \in \mathbb{R}^{d_v}$ . The attention scores are $s_{ij} = q_i^\top k_j / \sqrt{d_k}$ , producing attention weights via softmax: $\alpha_{ij} = \exp(s_{ij}) / \sum_{r=1}^T \exp(s_{ir})$ . Context vectors are formed as $g_i = \sum_{j=1}^T \alpha_{ij} v_j$ , which feed into output logits and probabilities.

With cross-entropy loss $L = -\sum_{i=1}^T \log p_{i, y_i}$ , define the upstream gradient at the context as $u_i = \partial L / \partial g_i = W_O^\top(p_i - e_{y_i}) \in \mathbb{R}^{d_v}$ . Compatibility scores between the gradient direction and each value vector are $b_{ij} := u_i^\top v_j$ ; their expectation under the attention distribution is $\mathbb{E}_{\alpha_i}[b] = \sum_{j=1}^T \alpha_{ij} b_{ij}$ .

2. Horizontal Gradient Law: Derivation and Implications

The "horizontal attention gradient" refers specifically to $\partial L / \partial s_{ij}$ , the gradient of the loss with respect to the attention score at position $(i,j)$ for fixed query $i$ and varying key $j$ . By direct differentiation, the gradient takes the form:

$\frac{\partial L}{\partial s_{ij}} = \alpha_{ij} \left( b_{ij} - \mathbb{E}_{\alpha_i}[b] \right)$

This formula is succinctly termed the advantage-based routing law (Aggarwal et al., 27 Dec 2025). Here, $\alpha_{ij}$ is the attention "responsibility" that query $i$ assigns to value $j$ ; $b_{ij}$ quantifies how beneficial increasing $v_j$ would be relative to the backward signal; and $\mathbb{E}_{\alpha_i}[b]$ sets a local baseline for comparison.

In gradient descent, attention mass is reallocated from positions with negative advantage ( $b_{ij} < \mathbb{E}[b]$ ) toward those with positive advantage ( $b_{ij} > \mathbb{E}[b]$ ). This results in a dynamic feedback process wherein both score allocations and value vector contents evolve jointly.

3. Reversed Attention and Gradient Flow Across Tokens

The horizontal gradient flow is further illuminated by the notion of "Reversed Attention" (Katz et al., 2024). Starting from the output $O = AV$ (row-wise dot products), and denoting $\delta = \partial L / \partial O$ , one obtains the gradient with respect to the attention matrix $A$ as $F = \delta V^\top$ . Backpropagating through the softmax yields, on each row,

$(\partial L / \partial X )_{i,:} = \text{diag}(A_{i,:}) F_{i,:} - A_{i,:} (A_{i,:}^\top F_{i,:})$

Forming the Reversed Attention matrix $R$ ,

$R_{ij} = A_{ij} ( F_{ij} - \sum_k F_{ik} A_{ik} )$

This matrix captures, for every query $i$ , the "horizontal" reallocation of gradient signal across key positions $j$ . $R$ is lower-triangular for causal models and is typically sparse, pinpointing the most influential token pairs for gradient-based adjustment.

4. Specialization and Positive Feedback: EM Analogy

The coupled update laws for scores (horizontal) and values (vertical) induce a specialization effect, whereby queries increasingly focus attention on the values most aligned with their error signals, and those values adapt to serve the needs of their principal queries. The process exhibits a two-timescale structure analogous to EM in mixture models (Aggarwal et al., 27 Dec 2025):

E-step: The attention weights $\alpha_{ij}$ act as soft responsibilities, dynamically favoring columns $j$ with above-average compatibility for the current query $i$ .
M-step: Value vectors $v_j$ aggregate upstream gradients $u_i$ , weighted by the current attention assignments, shifting toward the "centers" of their assigned queries.

Empirically, attention weights stabilize rapidly (fast E-step), locking the attention pattern early in training, while values continue drifting (slow M-step), supporting ongoing error reduction and model calibration.

5. Updating Values and Controlling Attention Directly

The update rule for value vectors under SGD with learning rate $\eta$ is:

$\Delta v_j = -\eta \sum_{i=1}^T \alpha_{ij} u_i$

This aggregates contributions from all queries according to their current assignment to value $j$ . In practice, attention-patching exploits the interpretable structure of Reversed Attention $R$ for direct intervention (Katz et al., 2024): replacing the forward attention matrix $A$ with $A' = \text{softmax}(QK^\top/\sqrt{d} + M) + \eta R$ allows external steering of attention flow at inference-time, without updating weights.

Experimental evidence demonstrates that such attention-patching can match or outperform few-shot prompting in in-context learning tasks, and that the magnitude and sparsity in $R$ allow prioritized, fast localization of important head-token interactions.

6. Geometry, Bayesian Manifolds, and Statistical Interpretation

Horizontal attention gradients play a key role in sculpting the internal geometry of transformer representations. They drive keys and queries to span nearly orthogonal axes in key space, sharpening the competition among value vectors and yielding low-dimensional subspaces where residual errors are minimized. These subspaces correspond to posterior-entropy manifolds as predicted by Bayesian inference models (Aggarwal et al., 27 Dec 2025). Once the attention assignment $\alpha_{ij}$ stabilizes, further changes are restricted to this subspace, cementing the connection between gradient optimization and statistical reasoning.

A plausible implication is that horizontal attention gradients provide the mechanism that unifies optimization (via gradient flow), emergent geometry (via manifold formation), and functional capacity (in-context probabilistic reasoning) in transformer architectures.

7. Interpretability, Practical Analysis, and Future Directions

Horizontal attention gradients offer an interpretable, fine-grained lens on transformer dynamics. The sparsity of Reversed Attention $R$ highlights key token pairs responsible for model updates, supporting efficient head ranking and intervention. The ability to inject $R$ in forward passes enables accurate behavior modification without retraining or parameter adjustment.

Ongoing research continues to elucidate the hierarchy and interaction among horizontal and vertical gradients, the role of coupled specialization dynamics in complex sequence learning, and the correspondence between empirical manifolds and Bayesian posteriors. These results offer foundational insights for both theoretical analysis and practical engineering of interpretable, controllable attention mechanisms.

Markdown Report Issue Upgrade to Chat

References (2)

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds (2025)

Reversed Attention: On The Gradient Descent Of Attention Layers In GPT (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Horizontal Attention Gradient.