Human-Interpretable QK Subspaces
- Human-interpretable QK subspaces are low-dimensional areas within a transformer's query-key space capturing specific semantic and structural features.
- They are derived using contrastive covariance and SVD, which isolate key attention patterns by decomposing bilinear interactions into interpretable axes.
- Empirical evaluations show these subspaces explain substantial attention mass in models, enabling clearer attribution of token interactions and causal analysis.
Human-interpretable QK subspaces are low-dimensional subspaces within a transformer's attention head query-key (QK) space that correspond to specific, human-meaningful semantic or structural features. The systematic extraction and analysis of these subspaces enable precise attribution of attention to interpretable interactions between queries and keys. This approach, introduced via the contrastive covariance framework, decomposes high-dimensional QK bilinear forms into a concise set of directions each associated with features such as category membership or token binding, significantly advancing the interpretability of transformer attention mechanisms (Lee et al., 4 Feb 2026).
1. The Query-Key Space in Transformers
In transformer architectures, attention heads compute affinity scores between token representations based on query () and key () vectors. For a sequence of tokens, and , with denoting the head dimension. The unnormalized attention logit from token to is
and the scaled attention score is
The QK space is thus a bilinear joint embedding space in which semantic or structural relationships are encoded as inner products between and . Traditional interpretability efforts have struggled to isolate which features in this space drive particular attention patterns; the identification of human-interpretable QK subspaces directly addresses this interpretability challenge.
2. Contrastive Covariance: Isolating Feature Interactions
The contrastive covariance approach isolates the contribution of specific features to QK interactions by constructing positive and negative (query, key) pairs based on a binary feature criterion that encodes the presence or absence of a shared feature between token pairs. For example, in a toy model may indicate matched latent variables; in a categorical filter head, signals that token belongs to the query category in prompt .
The central mathematical objects are the positive covariance
and negative covariance
where and are the global means over all (query, key) pairs. The contrastive covariance matrix is the difference
which algebraically isolates the bilinear interaction in QK space due to the feature of interest (Lee et al., 4 Feb 2026).
3. Low-Rank Decomposition and Subspace Construction
A singular value decomposition (SVD) of the contrastive covariance matrix yields
where and contain the left (query) and right (key) singular vectors, and the singular values. Selecting the top directions according to the criterion
(e.g. ), produces a low-dimensional subspace spanned by in query space and in key space. This subspace captures nearly all variance in QK interactions induced by the chosen feature.
The attention logit can then be decomposed as
with analogous decomposition for the scaled score . Each term attributes attention to a specific interpretable axis.
4. Attribution of Attention and Empirical Evaluation
Per-component attention can be defined for each as
and normalized as
to yield the fraction of attention mass attributable to each feature subspace.
Empirical evaluation in both toy models and LLMs demonstrates that these subspaces correspond to semantically meaningful structures and explain a substantial portion of the attention. In Llama 3.1-8B-Instruct, filter heads with categorical subspaces (rank 5) explain on average 68% of attention mass on category tokens (compared to a 10% baseline), and binding heads with order-ID and lexical-ID subspaces (ranks 3 and 9, respectively) explain 42% and 51%, with combined subspaces accounting for 85% (Table 1).
| Head type | Subspace | Rank | % explained |
|---|---|---|---|
| Filter Head | categorical | 5 | 68% |
| Binding Head | order-ID | 3 | 42% |
| Binding Head | lexical-ID | 9 | 51% |
| Binding Head | combined | 12 | 85% |
Causal interventions involving coordinate swaps in the recovered subspaces shift attention mass strongly (almost entirely in toy heads, substantially in real heads), confirming the causal relevance of these low-rank directions (Lee et al., 4 Feb 2026).
5. Example Applications: Semantic Categories and Binding
The methodology applies across both toy analytic settings and real LLMs. In a categorical filter head (e.g., "find the fruits" within item lists), the contrastive covariance isolates subspaces for each semantic category. Projecting queries and keys onto these bases and visualizing via PCA or UMAP recovers tight clusters corresponding to human-understandable categories (fruits, animals, vehicles, etc.).
For question-answer binding tasks ("The hat is in box O. … Which box is the jam in?"), two independent mechanisms are uncovered:
- An order-ID subspace (rank 2–3) reflects group-index-based entity binding.
- A lexical subspace (rank 9–10) captures lexical identity via counterfactual prompt manipulations.
Attention logit attributions further clarify how subspace components drive model predictions (Lee et al., 4 Feb 2026).
6. Algorithmic Workflow and Implementation Details
The extraction and analysis of human-interpretable QK subspaces proceed via the following workflow, referred to as Algorithm ContrastiveQKSubspaces:
- Input: Pre-computed , matrices; a binary feature criterion ; rank cutoff .
- Initialize , , and respective counts.
- Compute global means , .
- Accumulate for positive and negative pairs separately.
- Mean-normalize and compute .
- Conduct SVD; condition if necessary.
- Select to meet explained-variance threshold.
- Return , , .
Accumulative steps may be batched for memory efficiency. Mean-centering and small ridge additions are recommended for numerical stability when is nearly rank-deficient. The choice of or an elbow in the singular-value spectrum is advised (Lee et al., 4 Feb 2026).
7. Interpretability Impact and Research Significance
By isolating low-dimensional, human-interpretable subspaces in QK space, this methodology transforms the opaque dot-product attention mechanism into an interpretable sum where individual terms correspond to meaningful, quantifiable features. This approach enables:
- Feature discovery without the necessity of training external probes or autoencoders.
- Quantification of feature representation ranks and directions via singular values.
- Visualization of structure in the latent space through projections.
- Causal testing of feature roles via targeted interventions.
- Attribution of token-level model decisions to identified feature components.
These capabilities collectively enhance mechanistic transparency in transformer models and provide a robust foundation for further empirical and theoretical exploration of neural attention systems (Lee et al., 4 Feb 2026).