AtMan: Perturbation-Based Explainability for Transformers

Updated 18 December 2025

The paper presents a method that directly perturbs raw attention scores post-QK^T to quantify token relevance without requiring backward passes.
AtMan operates solely in the forward pass, enabling scalable and modality-agnostic explainability across text, vision, and multi-modal transformers.
Empirical results show competitive mAP and mAR metrics with minimal additional memory cost compared to gradient-based methods.

AtMan (Attention Manipulation) is a perturbation-based explainability method for generative transformer models that provides input-level relevance maps by directly manipulating internal attention scores. Designed to overcome the memory and computational barriers inherent in backpropagation-based explanation techniques, AtMan measures the impact of attenuating or amplifying attention for each input position on the model’s output distribution. Operating exclusively within the forward pass and leveraging parallelizable search over the embedding space, AtMan maintains almost no additional memory overhead beyond a standard forward computation and is agnostic to input modality, supporting text, vision, and multi-modal generative transformers (Deiseroth et al., 2023).

1. Motivation and Limitations of Existing Explanation Methods

Contemporary generative transformer architectures (e.g., GPT, MAGMA, BLIP) function as autoregressive models that map input sequences—potentially multi-modal—to output token distributions. Their substantial parameter counts and complex input representations enforce model opacity. State-of-the-art explainers such as Grad-CAM, Input×Gradient, Integrated Gradients, and Chefer et al.'s attention-and-gradient approaches all depend on backpropagation, resulting in GPU memory consumption roughly double that of a forward inference (for both activations and gradients). This resource intensity renders explanation impractical for large models (≥13B parameters) or for sequences longer than 1000 tokens, especially in production settings.

Gradient-free perturbation methods, such as LIME and SHAP, circumvent the need for backpropagation but require a prohibitively large number of forward passes—often tens or hundreds of thousands per example—to achieve statistically stable relevance estimates. This makes their application infeasible for large-scale or long-context transformers.

2. Core Principle and Algorithmic Structure

AtMan avoids both backpropagation and high-volume token masking by intervening directly in the model’s self-attention computations. The method operates as follows:

Direct attention-score manipulation: Instead of modifying input tokens or calculating gradients, AtMan perturbs the columns of the raw attention score matrices after the $QK^\top$ computation but before masking and softmax, for each layer and each attention head.
Suppression/amplification: For a given token index $i$ , the corresponding columns in all attention matrices are scaled by a factor $(1-f)$ , where $0 < f < 1$. This suppresses the influence of the probed token (or, via a related procedure, of multiple semantically similar tokens).
Forward-only evaluation: The modified attention scores are propagated through the network in a forward pass, and the shift in model output cross-entropy is measured.
Relevance computation: The difference in cross-entropy between the original and perturbed predictions quantifies the relevance of the probed token.

No backward computation is used at any stage. All calculations remain within the forward computational graph, and overhead is limited to efficient, on-the-fly attention scaling and the calculation of an input similarity matrix.

3. Mathematical Formulation

Let $x = [x_1, ..., x_n]$ denote the input sequence of $n$ embeddings (text, image patches, or mixed-modal), with unmodified model parameters $\theta$ , and target output token $y_t$ . The cross-entropy loss is

$L_\theta(x, y_t) = -\log P_\theta(y_t|x)$

Within each transformer layer $\ell$ and head $u$ , raw attention scores are computed as

$H^{(\ell)} = \frac{1}{\sqrt{d_h}} Q K^\top \;\; (\text{shape } h \times s \times s)$

where $d_h$ is the head dimension, $h$ is the number of heads, and $s$ is the sequence length. After application of the mask $M$ and softmax:

$H_M^{(\ell)} = H^{(\ell)} \odot M, \quad A^{(\ell)} = \mathrm{softmax}(H_M^{(\ell)}), \quad O^{(\ell)} = A^{(\ell)} V^{(\ell)}$

For single-token suppression, the $i$ -th column of $H^{(\ell)}$ in head $u$ is scaled by $(1-f)$ :

$\widetilde{H}^{(\ell)}_{u,:,j} = H^{(\ell)}_{u,:,j} \begin{cases} 1-f, & j = i \ 1, & j \neq i \end{cases}$

A new forward pass yields $L_\theta^{-i}(x, y_t)$ , and token influence is

$\Delta L_i = L_\theta^{-i}(x, y_t) - L_\theta(x, y_t)$

The relevance vector over input positions is $R(x, y) = [\Delta L_1,\dots,\Delta L_n]$ .

For correlated-token suppression (e.g., in multi-modal/vision domains where concepts span several tokens), cosine similarity between normalized input embeddings is used:

$S_{i,k} = \frac{X_i^\top X_k}{\|X_i\|\|X_k\|}$

Tokens $k$ with $S_{i,k} \geq \kappa$ (threshold $\kappa \in (0,1)$ ) receive correlated suppression, with per-column multipliers $(1 - f) + f w_{i,k}$ , $w_{i,k} = S_{i,k}$ or $0$.

4. Computational Efficiency and Deployment Considerations

AtMan’s principal efficiency derives from eliminating backward passes. Where gradient-based methods require all activation tensors for gradient retention—doubling effective memory usage—AtMan operates within the memory profile of a vanilla forward evaluation for the model, incurring negligible additional cost aside from storage of the $n \times n$ similarity matrix.

Naively, AtMan requires $O(n)$ forward passes to generate a complete relevance vector for sequence length $n$ . However, these passes are fully parallelizable in batches or can be distributed across multiple GPUs or pipeline stages. Additional optimization via chunked or hierarchical probing further mitigates runtime demands.

Method	Memory Overhead	Forward Passes Required	Scalability (large models)
Gradient-based (e.g., IG)	~2× forward pass	1 + (backward)	Limited (OOM ≥ 6B+ params)
LIME/SHAP (perturbation)	~1× but 10k–100k passes	10k–100k	Infeasible
AtMan	~1× forward pass	n (batched/parallelizable)	Scales to 30B+

This approach allows AtMan to operate with the largest commercially viable transformer models (e.g., MAGMA-13B, MAGMA-30B, BLIP) where gradient-based and classical perturbation methods exceed available memory or computational budgets.

5. Empirical Results

AtMan has been empirically validated on both text-only and multi-modal tasks:

Text QA (SQuAD, GPT-J 6B)
- Evaluation via mean Average Precision (mAP) and mean Average Recall (mAR) for predicting gold-standard answer spans.
- AtMan: mAP = 73.7%, mAR = 93.4%
- Chefer et al. (best gradient): mAP = 72.7%, mAR = 96.6%
- Integrated Gradients/Input×Grad: mAP ≈ 50%, mAR ≈ 90%
OpenImages Weak Segmentation (MAGMA-6B)
- Prompt: "<Image> This is a picture of ..." plus target class, evaluation vs. gold pixel-mask.
- AtMan: mAP ≈ 65.5%, mAR ≈ 15.7%
- Chefer et al.: mAP ≈ 58.3%, mAR ≈ 11.7%
- GradCAM/IG: mAP < 60%, mAR < 12%
- AtMan operates with MAGMA-13B/30B and BLIP models where gradient explainers run out of memory.

These results indicate parity or superiority over state-of-the-art gradient explainers in both qualitative and quantitative metrics, with markedly reduced resource constraints.

6. Modality-Agnosticism, Limitations, and Prospects

AtMan's reliance solely on attention-score manipulation after $QK^\top$ enables seamless application to text-only, vision-only, or multi-modal generative transformers, including encoder-decoders and decoder-with-adapter architectures. No architectural changes, backward passes, or parameter adjustments are imposed, ensuring broad compatibility with production deployments.

Nevertheless, AtMan’s O(n) pass requirement for naive per-token attribution may introduce latency on very long input sequences. Batch or chunked processing, as well as focused adaptive search, can alleviate this. The method requires tuning two hyperparameters: the suppression factor $f$ (default 0.9) and the cosine similarity threshold $\kappa$ (default 0.7); these are robust across tasks but may occasionally warrant task-specific adjustment. By construction, AtMan does not provide per-attention-head relevance explanations, as all manipulation is aggregated across heads.

Future work includes focus-adaptive search for impact-prioritized probing, alternative similarity metrics beyond plain input embedding cosine similarity (e.g., BERT embeddings for text), layer-wise and head-wise manipulations for granular explanatory localization, and joint explanation-guided fine-tuning for improved interpretability and model performance.

7. Summary and Significance

AtMan constitutes a forward-pass–only perturbation mechanism targeting the information flow within attention modules to yield competitive token- or patch-level relevance explanations at a fraction of the memory and computational cost associated with gradient-based and classical perturbation methods. Its extensibility to large-model and multi-modal deployments, combined with competitive or superior empirical metrics, situates AtMan as a practical and efficient tool for transformer interpretability at scale (Deiseroth et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AtMan Explainability Method.