Papers
Topics
Authors
Recent
Search
2000 character limit reached

Human-Interpretable QK Subspaces

Updated 9 February 2026
  • Human-interpretable QK subspaces are low-dimensional areas within a transformer's query-key space capturing specific semantic and structural features.
  • They are derived using contrastive covariance and SVD, which isolate key attention patterns by decomposing bilinear interactions into interpretable axes.
  • Empirical evaluations show these subspaces explain substantial attention mass in models, enabling clearer attribution of token interactions and causal analysis.

Human-interpretable QK subspaces are low-dimensional subspaces within a transformer's attention head query-key (QK) space that correspond to specific, human-meaningful semantic or structural features. The systematic extraction and analysis of these subspaces enable precise attribution of attention to interpretable interactions between queries and keys. This approach, introduced via the contrastive covariance framework, decomposes high-dimensional QK bilinear forms into a concise set of directions each associated with features such as category membership or token binding, significantly advancing the interpretability of transformer attention mechanisms (Lee et al., 4 Feb 2026).

1. The Query-Key Space in Transformers

In transformer architectures, attention heads compute affinity scores between token representations based on query (QQ) and key (KK) vectors. For a sequence of nn tokens, QRn×dQ \in \mathbb{R}^{n \times d} and KRn×dK \in \mathbb{R}^{n \times d}, with dd denoting the head dimension. The unnormalized attention logit from token ii to jj is

ij=QiKj\ell_{ij} = Q_i^\top K_j

and the scaled attention score is

aij=ijd.a_{ij} = \frac{\ell_{ij}}{\sqrt{d}}.

The QK space is thus a bilinear joint embedding space in which semantic or structural relationships are encoded as inner products between QQ and KK. Traditional interpretability efforts have struggled to isolate which features in this space drive particular attention patterns; the identification of human-interpretable QK subspaces directly addresses this interpretability challenge.

2. Contrastive Covariance: Isolating Feature Interactions

The contrastive covariance approach isolates the contribution of specific features to QK interactions by constructing positive and negative (query, key) pairs based on a binary feature criterion F(i,j){+1,1}F(i, j) \in \{+1, -1\} that encodes the presence or absence of a shared feature between token pairs. For example, in a toy model F(i,j)=+1F(i,j)=+1 may indicate matched latent variables; in a categorical filter head, F(i,j)=+1F(i,j)=+1 signals that token jj belongs to the query category in prompt ii.

The central mathematical objects are the positive covariance

Σ+=Cov(i,j)P[Qi,Kj]=E(i,j)P[(QiμQ)(KjμK)]\Sigma^+ = \mathrm{Cov}_{(i,j) \in P}[Q_i, K_j] = \mathbb{E}_{(i,j)\in P}\big[(Q_i-\mu_Q)(K_j-\mu_K)^\top\big]

and negative covariance

Σ=Cov(i,j)N[Qi,Kj]=E(i,j)N[(QiμQ)(KjμK)]\Sigma^- = \mathrm{Cov}_{(i,j) \in N}[Q_i, K_j] = \mathbb{E}_{(i,j)\in N}\big[(Q_i-\mu_Q)(K_j-\mu_K)^\top\big]

where μQ\mu_Q and μK\mu_K are the global means over all (query, key) pairs. The contrastive covariance matrix is the difference

ΔΣ=Σ+Σ\Delta\Sigma = \Sigma^+ - \Sigma^-

which algebraically isolates the bilinear interaction in QK space due to the feature of interest (Lee et al., 4 Feb 2026).

3. Low-Rank Decomposition and Subspace Construction

A singular value decomposition (SVD) of the contrastive covariance matrix yields

ΔΣ=UΛV\Delta\Sigma = U\Lambda V^\top

where UU and VV contain the left (query) and right (key) singular vectors, and Λ\Lambda the singular values. Selecting the top rr directions according to the criterion

k=1rλk2ΔΣF2τ\frac{\sum_{k=1}^r \lambda_k^2}{\|\Delta\Sigma\|_F^2} \geq \tau

(e.g. τ=0.99\tau = 0.99), produces a low-dimensional subspace spanned by {u1,,ur}\{u_1,\ldots,u_r\} in query space and {v1,,vr}\{v_1,\ldots,v_r\} in key space. This subspace captures nearly all variance in QK interactions induced by the chosen feature.

The attention logit can then be decomposed as

ijk=1rλk(Qiuk)(vkKj)\ell_{ij} \approx \sum_{k=1}^r \lambda_k (Q_i \cdot u_k)(v_k \cdot K_j)

with analogous decomposition for the scaled score aija_{ij}. Each term attributes attention to a specific interpretable axis.

4. Attribution of Attention and Empirical Evaluation

Per-component attention can be defined for each kk as

aij(k)=1dλk(Qiuk)(Kjvk)a_{ij}^{(k)} = \frac{1}{\sqrt{d}} \lambda_k (Q_i \cdot u_k)(K_j \cdot v_k)

and normalized as

ρij(k)=aij(k)=1raij()\rho_{ij}^{(k)} = \frac{a_{ij}^{(k)}}{\sum_{\ell=1}^r a_{ij}^{(\ell)}}

to yield the fraction of attention mass attributable to each feature subspace.

Empirical evaluation in both toy models and LLMs demonstrates that these subspaces correspond to semantically meaningful structures and explain a substantial portion of the attention. In Llama 3.1-8B-Instruct, filter heads with categorical subspaces (rank 5) explain on average 68% of attention mass on category tokens (compared to a 10% baseline), and binding heads with order-ID and lexical-ID subspaces (ranks \approx3 and \approx9, respectively) explain 42% and 51%, with combined subspaces accounting for 85% (Table 1).

Head type Subspace Rank % explained
Filter Head categorical 5 68%
Binding Head order-ID 3 42%
Binding Head lexical-ID 9 51%
Binding Head combined 12 85%

Causal interventions involving coordinate swaps in the recovered subspaces shift attention mass strongly (almost entirely in toy heads, substantially in real heads), confirming the causal relevance of these low-rank directions (Lee et al., 4 Feb 2026).

5. Example Applications: Semantic Categories and Binding

The methodology applies across both toy analytic settings and real LLMs. In a categorical filter head (e.g., "find the fruits" within item lists), the contrastive covariance isolates subspaces for each semantic category. Projecting queries and keys onto these bases and visualizing via PCA or UMAP recovers tight clusters corresponding to human-understandable categories (fruits, animals, vehicles, etc.).

For question-answer binding tasks ("The hat is in box O. … Which box is the jam in?"), two independent mechanisms are uncovered:

  • An order-ID subspace (rank \approx2–3) reflects group-index-based entity binding.
  • A lexical subspace (rank \approx9–10) captures lexical identity via counterfactual prompt manipulations.

Attention logit attributions j=j(order)+j(lex)+j(resid)\ell_j = \ell_j^{(\mathrm{order})} + \ell_j^{(\mathrm{lex})} + \ell_j^{(\mathrm{resid})} further clarify how subspace components drive model predictions (Lee et al., 4 Feb 2026).

6. Algorithmic Workflow and Implementation Details

The extraction and analysis of human-interpretable QK subspaces proceed via the following workflow, referred to as Algorithm ContrastiveQKSubspaces:

  1. Input: Pre-computed QQ, KK matrices; a binary feature criterion F(i,j)F(i, j); rank cutoff τ\tau.
  2. Initialize Σ+\Sigma^+, Σ\Sigma^-, and respective counts.
  3. Compute global means μQ\mu_Q, μK\mu_K.
  4. Accumulate (QiμQ)(KjμK)(Q_i - \mu_Q)(K_j - \mu_K)^\top for positive and negative pairs separately.
  5. Mean-normalize and compute ΔΣ=Σ+Σ\Delta\Sigma = \Sigma^+ - \Sigma^-.
  6. Conduct SVD; condition if necessary.
  7. Select rr to meet explained-variance threshold.
  8. Return U:rU_{:r}, Λ:r\Lambda_{:r}, V:rV_{:r}.

Accumulative steps may be batched for memory efficiency. Mean-centering and small ridge additions are recommended for numerical stability when ΔΣ\Delta\Sigma is nearly rank-deficient. The choice of τ=0.99\tau=0.99 or an elbow in the singular-value spectrum is advised (Lee et al., 4 Feb 2026).

7. Interpretability Impact and Research Significance

By isolating low-dimensional, human-interpretable subspaces in QK space, this methodology transforms the opaque dot-product attention mechanism into an interpretable sum where individual terms correspond to meaningful, quantifiable features. This approach enables:

  • Feature discovery without the necessity of training external probes or autoencoders.
  • Quantification of feature representation ranks and directions via singular values.
  • Visualization of structure in the latent space through projections.
  • Causal testing of feature roles via targeted interventions.
  • Attribution of token-level model decisions to identified feature components.

These capabilities collectively enhance mechanistic transparency in transformer models and provide a robust foundation for further empirical and theoretical exploration of neural attention systems (Lee et al., 4 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Human-Interpretable QK Subspaces.