CLAP: Causal Layer Attribution for Neural Models

Updated 9 February 2026

The paper introduces CLAP, a method using intervention-based activation patching to quantify layer contributions to a model’s factual output.
It employs a three-stage protocol—activation caching, logit-difference metric computation, and activation patching—to yield both qualitative insights and quantitative recovery metrics.
Results reveal distinct knowledge localization, with output layers fully restoring performance and intermediate layers showing partial, distributed effects.

Causal Layer Attribution via Activation Patching (CLAP) is a mechanistic interpretability and model-editing technique that quantifies the contribution of individual neural network layers (or edges in computational graphs) to a model’s factual output preferences. Grounded in the framework of causal interventions, CLAP assesses which layers are causally responsible for producing correct versus incorrect outputs by systematically patching activations from a reference (clean) run into a corrupted (distractor) context. This approach produces both qualitative mechanistic insight—differentiating localized and distributed knowledge—and quantitative metrics useful for targeted model editing and circuit discovery (Bahador, 3 Apr 2025, Syed et al., 2023).

1. Formal Foundations and Causal Framework

CLAP is formulated within the Structural Causal Model (SCM) paradigm, treating hidden activations at each layer as “treatable” variables amenable to surgical do-interventions. For a LLM $f_\theta:\mathbb{Z}^n\to\mathbb{R}^{|V|}$ with $L$ transformer layers, input $x$ , and output vocabulary $V$ , the hidden states propagate as follows: $h^{(0)} = \text{Embed}(x),\quad\text{for }l=1,\dots,L:\ h^{(l)} = \text{Block}^{l}(h^{(l-1)}),\quad\text{logits} = W_{\text{out}}\cdot h^{(L)}$ Intervening at layer $l$ via $h^{(l)}\leftarrow h^{(l)}_{\text{clean}}$ is equivalent to the SCM operation $\text{do}(h^{(l)}\leftarrow h^{(l)}_{\text{clean}})$ , directly testing the causal effect of this layer’s activations on the logit output and, by extension, on downstream preference for correct versus incorrect answers (Bahador, 3 Apr 2025).

2. CLAP Experimental Protocol and Mathematical Metrics

The CLAP workflow comprises three principal stages:

Activation Caching: Given a prompt, generate and store:
- $a_l^{\text{clean}} := h^{(l)}(x_{\text{clean}})$ (clean prompt, correct answer)
- $a_l^{\text{corr}} := h^{(l)}(x_{\text{corr}})$ (same prompt, incorrect answer)
Logit-Difference Metric Computation: The model's preference is defined as the expected difference in logits between the token set $L$ 0 (correct answer) and $L$ 1 (incorrect answer):

$L$ 2

$L$ 3

Activation Patching (Causal Intervention): For each layer $L$ $L$ 4, perform:
- Forward pass on $L$ 5 with $L$ 6 set to $L$ 7 (i.e., $L$ 8-intervention).
- Evaluate the recovered logit difference:
$L$ 9

Compute fractional recovery:

$x$ 0

Recovery values range from $x$ 1 (no effect) to $x$ 2 (full restoration of preference).

Statistical validation proceeds via paired t-tests comparing pre- and post-patched $x$ 3, with p-values demonstrating layer-specific significance (Bahador, 3 Apr 2025).

3. Layerwise and Edgewise Attribution: Practical Variants

CLAP applies at multiple granularities:

Layer-wise CLAP localizes contributions of transformer blocks and specific submodules (feedforward, output projection, convolutional).
Edge Attribution Patching (EAP): An “edgewise” generalization using a linear Taylor expansion for scalably attributing importance to each computational edge. EAP executes only two forward passes (clean, corrupted) and one backward pass (for all edges), leveraging a first-order approximation:

$x$ 4

This produces absolute attribution scores $x$ 5 for importance ranking and subgraph pruning. EAP achieves circuit-discovery comparable or superior to prior methods (e.g., ACDC) at orders-of-magnitude lower compute cost, with typical AUC values (IOI: 0.95 vs. ACDC’s 0.87; “Greater-Than”: 0.89 vs. 0.83) (Syed et al., 2023).

4. Empirical Findings and Quantitative Results

Application of CLAP to a 12-layer GPT-2 fine-tuned on 9,958 PubMed abstracts (epilepsy, EEG, seizure) highlights a functional dissociation between localized and distributed factual representations:

Layer / Component	Fractional Recovery (%)	Statistical Significance	Interpretation
First feedforward sublayer	≈ 56%	p < 0.001	Substantial, but incomplete, associative
Final output projection $x$ 6	100%	p < 0.0001	Fully localized, definitional knowledge
Intermediate Conv1D	13.6%	p = 0.008	Minor, low-level feature mixing

Intermediate and convolutional layers yield only partial recovery, whereas patching the final projection achieves full restoration of correct-answer preference (Bahador, 3 Apr 2025).

5. Mechanistic Interpretations and Knowledge Localization

The results differentiate factual knowledge organization in autoregressive transformers:

Definitional/single-hop knowledge (e.g., explicit acronym expansions) is highly localized in the output weights $x$ 7. Full (100%) recovery via patching indicates such knowledge is “packed” into a single layer.
Associative/multi-hop reasoning (e.g., indirect relationships) is distributed, requiring integration across several intermediate layers; no single patch suffices for complete recovery (e.g., only 56% at the first feedforward).
Low-level transformations (Conv1D, attention mixing) provide marginal contribution (13.6%), indicating limited involvement in explicit high-level retrieval.

A plausible implication is that model-editing interventions must be adaptive: direct factual updates (definitions) target output weights, while editing associative facts requires coordinated changes across a distributed pathway (Bahador, 3 Apr 2025).

6. Computational Efficiency and Practical Advantages

Relative to exhaustive layer/edge-level activation patching (as exemplified by ACDC), CLAP—especially in its EAP form—offers major efficiency gains. EAP assigns edgewise importance using only two forward passes and a single backward pass per example, as opposed to $x$ 8 forward passes for $x$ 9 edges and $V$ 0 examples, achieving a $V$ 1– $V$ 2 reduction in computation on GPT-2-small (Syed et al., 2023). In practice, EAP outperforms or matches the ground-truth circuit recovery of previous methods while maintaining computational tractability for large models.

7. Limitations and Recommendations

Identified limitations include:

Poor linear approximation of non-linear metric responses, especially for embedding edges or highly non-linear metrics (e.g., KL-divergence at zero baseline).
Occasional systematic overestimation of effect sizes by EAP (empirically, best-fit slope ≈ 0.5), suggesting benefit from hybrid pipelines (EAP for initial pruning, followed by exact patching).
Task-adaptive methodology is essential: the localization versus distribution of knowledge is task-specific, requiring interpreters to tailor probe and edit strategies accordingly (Syed et al., 2023, Bahador, 3 Apr 2025).

These findings reconcile previous divergent observations regarding parameter localization in neural models and establish a rigorous, efficient foundation for mechanistic interpretability and model editing grounded in causal intervention principles.

Markdown Report Issue Upgrade to Chat

References (2)

Localized Definitions and Distributed Reasoning: A Proof-of-Concept Mechanistic Interpretability Study via Activation Patching (2025)

Attribution Patching Outperforms Automated Circuit Discovery (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Causal Layer Attribution via Activation Patching (CLAP).