Papers
Topics
Authors
Recent
Search
2000 character limit reached

TensorLens: End-to-End Transformer Analysis

Updated 1 February 2026
  • TensorLens is a framework that represents a transformer as an input-dependent linear operator using a high-order attention–interaction tensor to encapsulate all computational components.
  • It reformulates multi-head attention, layer normalization, feed-forward networks, and residuals into a unified linear Jacobian through precise vectorization techniques.
  • Empirical tests demonstrate that TensorLens outperforms traditional methods in visualization, probing, and manipulation of transformer behaviors, supporting tasks like model distillation and ablation.

TensorLens is a theoretical and practical framework for end-to-end transformer analysis, centered on the construction of a high-order attention–interaction tensor that encodes a full transformer as a single, input-dependent linear operator. This tensorial representation captures all computational components of a transformer block—including multi-head attention, feed-forward networks (FFN), layer normalizations, and residual connections—in a unified, expressive formalism. TensorLens provides both the mathematical apparatus and empirical tools for interpretability, visualization, manipulation, and probing of transformer architectures, overcoming limitations of previous attention-aggregation methodologies (Atad et al., 25 Jan 2026).

1. Mathematical Formulation of the High-Order Attention–Interaction Tensor

TensorLens formulates a vanilla NN-layer transformer, at fixed input XX, as an input-conditioned linear operator TRL×D×L×DT \in \mathbb{R}^{L\times D\times L\times D}, where LL is the sequence length and DD the hidden dimension, so that F(X)=T(X)F(X)=T(X) and

vec[F(X)]=Tmatvec[X],\mathrm{vec}[F(X)] = T_{\mathrm{mat}} \cdot \mathrm{vec}[X],

with TmatR(LD)×(LD)T_{\mathrm{mat}} \in \mathbb{R}^{(LD)\times(LD)} a matrix unfolding of TT. The model-wide tensor TT is the ordered product of per-layer block-tensors TnT^n: T=n=1NTn,T = \prod_{n=1}^N T^n, and thereby

vec[F(X)]=(n=1NTn)vec[X].\mathrm{vec}[F(X)] = \Big(\prod_{n=1}^N T^n\Big)\cdot \mathrm{vec}[X].

Each TnT^n encapsulates attention, both residual and non-residual pathways, layer normalizations, and FFN operations as a single linear transformation: Tn=L2n(Mn+I)L1n(An+I),T^n = L_2^n \cdot (M^n + I) \cdot L_1^n \cdot (A^n + I), where AA represents the multi-head attention tensor, L1L_1 and L2L_2 are the two layer normalization tensors, MM is the FFN linearization tensor, and II the identity. The Kronecker-products and diagonalizations required to form these sub-tensors are derived explicitly for each operation, with the entire construction being local—i.e., functionally dependent on the specific input instance XX by using statistics (e.g., layernorm means, variances, activation slopes) observed on XX.

2. Stepwise Derivation and Structural Intuition

The derivation exploits standard vectorization identities to recast all individual sub-layer computations into the form vec[f(X)]=Tfvec[X]\mathrm{vec}[f(X)] = T_f \cdot \mathrm{vec}[X], specifically:

  • Self-attention: Combines token–token interactions (length–length) and feature–feature correlations (dimension–dimension) as a sum of Kronecker products h(Wv,hWo,h)Ah\sum_h (W_{v,h}W_{o,h})^\top \otimes A_h, where AhA_h is the per-head attention matrix and Wv,h,Wo,hW_{v,h}, W_{o,h} are value/output projections.
  • Layer Normalization & FFN: Conditioned on fixed input XX (so μ,σ,ϕ\mu,\,\sigma,\,\phi' are frozen), both LayerNorm and FFN become data-dependent diagonal linear operators.
  • Residuals: Incorporated via additive identity; vectorized as (I+G)vec[X](I + G)\cdot\mathrm{vec}[X].
  • Compositionality: Stacking all blocks yields a nested or concatenated product of the blockwise TnT^n.

This linearization is locally faithful—by definition, TT is the exact Jacobian of the transformer's forward function, patched such that all nonlinearities (softmax, LayerNorm, activation slopes) are “frozen” at values computed on XX.

3. Computational Construction and Example Pseudocode

To compute TXT_X for a given input XX, TensorLens uses automatic differentiation. The approach fixes nonlinearities (i.e., computes and freezes softmax weights, norms, and activation slopes at XX) and computes the output’s total Jacobian with respect to the input:

1
2
3
4
5
6
7
8
9
10
11
def compute_tensor(model, X):
    # 1) run a forward pass to cache all A_h, σ, φ′, etc.
    Y = model.forward_with_cache(X)
    # 2) define a linearized version that uses the cached A_h, φ′, σ
    def lin_model(X_):
        return model.linearized_forward(X_)  # returns (L,D)
    # 3) compute the Jacobian dY/dX: shape (L,D,L,D)
    T = torch.autograd.functional.jacobian(lin_model, X, create_graph=False)
    # T has shape (L,D,L,D)
    return T
A simple worked example (e.g., L=2L=2, D=2D=2, N=1N=1, H=1H=1) is constructed by directly plugging in 2×22\times2 weight matrices and applying the four Kronecker and diagonalization operations.

4. Comparison to Previous Attention Aggregation Methodologies

TensorLens differs fundamentally from earlier aggregation schemes. The following table summarizes its relation to major prior approaches:

Method Included Components Notable Omissions
Attn (head-averaging) AhA_h per layer Projections, residuals, FFN, LN
Rollout [Abnar & Zuidema] Chained averages As above
Value-weighted [Kobayashi] Incorporates WvWoW_vW_o LayerNorm, residuals, FFN
W.AttnResLN [Kobayashi '21] Residuals, first LN FFN, second LN
GlobEnc [Modarressi '22] Two LNs added FFN
TensorLens All linear ops, input/output embeddings, activations

Only TensorLens:

  • Is fully principled, incorporating all linear operations, both LayerNorms, both FFN projections, activation slopes, residual adds, and embeddings.
  • Is exact (first-order) at a given XX, as it is the literal Jacobian of the model’s patched forward function (local error bounded by Proposition 1).
  • Offers flexible axis collapses to derive generalized or specialized L×LL \times L attention maps that subsume previous variants.

5. Empirical Evaluation and Applications

Extensive empirical tests demonstrate that TensorLens provides superior fidelity and interpretability compared to previous aggregation schemes.

  • Perturbation Tests: On DeiT-Base/Small (ImageNet), TensorLens-based maps (“Tensor,Norm” and “Tensor,In+Out”) achieve AUC >> 0.66/0.82 (versus <<0.60 for any non-tensor baseline). For BERT-family and Gemma3 models on IMDB, TensorLens AUC >> 0.10/0.16 (<<0.09 for non-tensor baselines). On decoder-only LLMs (Pythia-1B, Pico-570M, Phi-1.5 on WikiText-103), TensorLens is top-1 or top-2 by HS-MSE, AOPC metrics.
  • Relation Decoding: Averaging per-example tensors Tˉr=1mi=1mTXi\bar{T}_r = \frac{1}{m}\sum_{i=1}^m T_{X_i} yields a relation-specific linear map, matching or exceeding the Linear Relation Extraction (LRE) baseline on Pythia-1B, which only considers   logits/\partial\; \mathrm{logits}/\partial embedding.
  • Interpretability and Visualization: By collapsing TT to L×LL \times L attention maps (via norms, in+out embeddings, or per-class output projections), TensorLens recovers or extends attribution maps for input token importance. Examples include Tnorm[i,j]=T[i,:,j,:]2T_{\mathrm{norm}}[i,j] = \|T[i,:,j,:]\|_2, Tio[i,j]=XN[i,:]T[i,:,j,:](X0[j,:])T_{\mathrm{io}}[i,j] = X^N[i,:] T[i,:,j,:]( X^0[j,:])^\top, and Tcls[c,ij]=Eout[:,c]T[i,:,j,:]X0[j,:]T_{\mathrm{cls}}[c,i \gets j] = E_{\mathrm{out}}[:,c]^\top T[i,:,j,:] X^0[j,:]^\top.
  • Manipulation and Distillation: TT (as the local linearization of FF at XX) is directly usable for linear distillation (cf. “LoLCats” by Zhang et al. ’24). Model interventions can be effected by masking subtensors within TT, with immediate re-evaluation of collapsed attention maps.

A memory-efficient Jacobian-slice implementation, as well as full code and worked examples, are available at https://github.com/idoatad/TensorLens.

6. Theoretical Guarantees and Scope

TensorLens is theoretically grounded, representing the first complete, input-dependent, high-order tensor formalization of a transformer’s global linear behavior. It encapsulates all prior “extended attention” proposals as strict special cases—achievable via particular axis reductions or omission of components. Proposition 1 in the source material provides local error bounds for the Jacobian approximation. The framework operates directly with input and output embeddings, and includes the capacity to trace, ablate, or visualize the influence of any model subcomponent within TT at the granularity of tokens, neurons, or projection subspaces.

A plausible implication is that TensorLens may serve as a foundational analytic tool for the next generation of mechanistic interpretability and model-editing methodologies, providing fine-grained, exact, and extensible representations of transformer computations.

Reference: [TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors, (Atad et al., 25 Jan 2026)]

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TensorLens.