Human-Interpretable QK Subspaces

Updated 9 February 2026

Human-interpretable QK subspaces are low-dimensional areas within a transformer's query-key space capturing specific semantic and structural features.
They are derived using contrastive covariance and SVD, which isolate key attention patterns by decomposing bilinear interactions into interpretable axes.
Empirical evaluations show these subspaces explain substantial attention mass in models, enabling clearer attribution of token interactions and causal analysis.

Human-interpretable QK subspaces are low-dimensional subspaces within a transformer's attention head query-key (QK) space that correspond to specific, human-meaningful semantic or structural features. The systematic extraction and analysis of these subspaces enable precise attribution of attention to interpretable interactions between queries and keys. This approach, introduced via the contrastive covariance framework, decomposes high-dimensional QK bilinear forms into a concise set of directions each associated with features such as category membership or token binding, significantly advancing the interpretability of transformer attention mechanisms (Lee et al., 4 Feb 2026).

1. The Query-Key Space in Transformers

In transformer architectures, attention heads compute affinity scores between token representations based on query ( $Q$ ) and key ( $K$ ) vectors. For a sequence of $n$ tokens, $Q \in \mathbb{R}^{n \times d}$ and $K \in \mathbb{R}^{n \times d}$ , with $d$ denoting the head dimension. The unnormalized attention logit from token $i$ to $j$ is

$\ell_{ij} = Q_i^\top K_j$

and the scaled attention score is

$a_{ij} = \frac{\ell_{ij}}{\sqrt{d}}.$

The QK space is thus a bilinear joint embedding space in which semantic or structural relationships are encoded as inner products between $Q$ and $K$ . Traditional interpretability efforts have struggled to isolate which features in this space drive particular attention patterns; the identification of human-interpretable QK subspaces directly addresses this interpretability challenge.

2. Contrastive Covariance: Isolating Feature Interactions

The contrastive covariance approach isolates the contribution of specific features to QK interactions by constructing positive and negative (query, key) pairs based on a binary feature criterion $F(i, j) \in \{+1, -1\}$ that encodes the presence or absence of a shared feature between token pairs. For example, in a toy model $F(i,j)=+1$ may indicate matched latent variables; in a categorical filter head, $F(i,j)=+1$ signals that token $j$ belongs to the query category in prompt $i$ .

The central mathematical objects are the positive covariance

$\Sigma^+ = \mathrm{Cov}_{(i,j) \in P}[Q_i, K_j] = \mathbb{E}_{(i,j)\in P}\big[(Q_i-\mu_Q)(K_j-\mu_K)^\top\big]$

and negative covariance

$\Sigma^- = \mathrm{Cov}_{(i,j) \in N}[Q_i, K_j] = \mathbb{E}_{(i,j)\in N}\big[(Q_i-\mu_Q)(K_j-\mu_K)^\top\big]$

where $\mu_Q$ and $\mu_K$ are the global means over all (query, key) pairs. The contrastive covariance matrix is the difference

$\Delta\Sigma = \Sigma^+ - \Sigma^-$

which algebraically isolates the bilinear interaction in QK space due to the feature of interest (Lee et al., 4 Feb 2026).

3. Low-Rank Decomposition and Subspace Construction

A singular value decomposition (SVD) of the contrastive covariance matrix yields

$\Delta\Sigma = U\Lambda V^\top$

where $U$ and $V$ contain the left (query) and right (key) singular vectors, and $\Lambda$ the singular values. Selecting the top $r$ directions according to the criterion

$\frac{\sum_{k=1}^r \lambda_k^2}{\|\Delta\Sigma\|_F^2} \geq \tau$

(e.g. $\tau = 0.99$ ), produces a low-dimensional subspace spanned by $\{u_1,\ldots,u_r\}$ in query space and $\{v_1,\ldots,v_r\}$ in key space. This subspace captures nearly all variance in QK interactions induced by the chosen feature.

The attention logit can then be decomposed as

$\ell_{ij} \approx \sum_{k=1}^r \lambda_k (Q_i \cdot u_k)(v_k \cdot K_j)$

with analogous decomposition for the scaled score $a_{ij}$ . Each term attributes attention to a specific interpretable axis.

4. Attribution of Attention and Empirical Evaluation

Per-component attention can be defined for each $k$ as

$a_{ij}^{(k)} = \frac{1}{\sqrt{d}} \lambda_k (Q_i \cdot u_k)(K_j \cdot v_k)$

and normalized as

$\rho_{ij}^{(k)} = \frac{a_{ij}^{(k)}}{\sum_{\ell=1}^r a_{ij}^{(\ell)}}$

to yield the fraction of attention mass attributable to each feature subspace.

Empirical evaluation in both toy models and LLMs demonstrates that these subspaces correspond to semantically meaningful structures and explain a substantial portion of the attention. In Llama 3.1-8B-Instruct, filter heads with categorical subspaces (rank 5) explain on average 68% of attention mass on category tokens (compared to a 10% baseline), and binding heads with order-ID and lexical-ID subspaces (ranks $\approx$ 3 and $\approx$ 9, respectively) explain 42% and 51%, with combined subspaces accounting for 85% (Table 1).

Head type	Subspace	Rank	% explained
Filter Head	categorical	5	68%
Binding Head	order-ID	3	42%
Binding Head	lexical-ID	9	51%
Binding Head	combined	12	85%

Causal interventions involving coordinate swaps in the recovered subspaces shift attention mass strongly (almost entirely in toy heads, substantially in real heads), confirming the causal relevance of these low-rank directions (Lee et al., 4 Feb 2026).

5. Example Applications: Semantic Categories and Binding

The methodology applies across both toy analytic settings and real LLMs. In a categorical filter head (e.g., "find the fruits" within item lists), the contrastive covariance isolates subspaces for each semantic category. Projecting queries and keys onto these bases and visualizing via PCA or UMAP recovers tight clusters corresponding to human-understandable categories (fruits, animals, vehicles, etc.).

For question-answer binding tasks ("The hat is in box O. … Which box is the jam in?"), two independent mechanisms are uncovered:

An order-ID subspace (rank $\approx$ 2–3) reflects group-index-based entity binding.
A lexical subspace (rank $\approx$ 9–10) captures lexical identity via counterfactual prompt manipulations.

Attention logit attributions $\ell_j = \ell_j^{(\mathrm{order})} + \ell_j^{(\mathrm{lex})} + \ell_j^{(\mathrm{resid})}$ further clarify how subspace components drive model predictions (Lee et al., 4 Feb 2026).

6. Algorithmic Workflow and Implementation Details

The extraction and analysis of human-interpretable QK subspaces proceed via the following workflow, referred to as Algorithm ContrastiveQKSubspaces:

Input: Pre-computed $Q$ , $K$ matrices; a binary feature criterion $F(i, j)$ ; rank cutoff $\tau$ .
Initialize $\Sigma^+$ , $\Sigma^-$ , and respective counts.
Compute global means $\mu_Q$ , $\mu_K$ .
Accumulate $(Q_i - \mu_Q)(K_j - \mu_K)^\top$ for positive and negative pairs separately.
Mean-normalize and compute $\Delta\Sigma = \Sigma^+ - \Sigma^-$ .
Conduct SVD; condition if necessary.
Select $r$ to meet explained-variance threshold.
Return $U_{:r}$ , $\Lambda_{:r}$ , $V_{:r}$ .

Accumulative steps may be batched for memory efficiency. Mean-centering and small ridge additions are recommended for numerical stability when $\Delta\Sigma$ is nearly rank-deficient. The choice of $\tau=0.99$ or an elbow in the singular-value spectrum is advised (Lee et al., 4 Feb 2026).

7. Interpretability Impact and Research Significance

By isolating low-dimensional, human-interpretable subspaces in QK space, this methodology transforms the opaque dot-product attention mechanism into an interpretable sum where individual terms correspond to meaningful, quantifiable features. This approach enables:

Feature discovery without the necessity of training external probes or autoencoders.
Quantification of feature representation ranks and directions via singular values.
Visualization of structure in the latent space through projections.
Causal testing of feature roles via targeted interventions.
Attribution of token-level model decisions to identified feature components.

These capabilities collectively enhance mechanistic transparency in transformer models and provide a robust foundation for further empirical and theoretical exploration of neural attention systems (Lee et al., 4 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Decomposing Query-Key Feature Interactions Using Contrastive Covariances (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Human-Interpretable QK Subspaces.