Papers
Topics
Authors
Recent
Search
2000 character limit reached

CLAP: Causal Layer Attribution for Neural Models

Updated 9 February 2026
  • The paper introduces CLAP, a method using intervention-based activation patching to quantify layer contributions to a model’s factual output.
  • It employs a three-stage protocol—activation caching, logit-difference metric computation, and activation patching—to yield both qualitative insights and quantitative recovery metrics.
  • Results reveal distinct knowledge localization, with output layers fully restoring performance and intermediate layers showing partial, distributed effects.

Causal Layer Attribution via Activation Patching (CLAP) is a mechanistic interpretability and model-editing technique that quantifies the contribution of individual neural network layers (or edges in computational graphs) to a model’s factual output preferences. Grounded in the framework of causal interventions, CLAP assesses which layers are causally responsible for producing correct versus incorrect outputs by systematically patching activations from a reference (clean) run into a corrupted (distractor) context. This approach produces both qualitative mechanistic insight—differentiating localized and distributed knowledge—and quantitative metrics useful for targeted model editing and circuit discovery (Bahador, 3 Apr 2025, Syed et al., 2023).

1. Formal Foundations and Causal Framework

CLAP is formulated within the Structural Causal Model (SCM) paradigm, treating hidden activations at each layer as “treatable” variables amenable to surgical do-interventions. For a LLM fθ:ZnRVf_\theta:\mathbb{Z}^n\to\mathbb{R}^{|V|} with LL transformer layers, input xx, and output vocabulary VV, the hidden states propagate as follows: h(0)=Embed(x),for l=1,,L: h(l)=Blockl(h(l1)),logits=Wouth(L)h^{(0)} = \text{Embed}(x),\quad\text{for }l=1,\dots,L:\ h^{(l)} = \text{Block}^{l}(h^{(l-1)}),\quad\text{logits} = W_{\text{out}}\cdot h^{(L)} Intervening at layer ll via h(l)hclean(l)h^{(l)}\leftarrow h^{(l)}_{\text{clean}} is equivalent to the SCM operation do(h(l)hclean(l))\text{do}(h^{(l)}\leftarrow h^{(l)}_{\text{clean}}), directly testing the causal effect of this layer’s activations on the logit output and, by extension, on downstream preference for correct versus incorrect answers (Bahador, 3 Apr 2025).

2. CLAP Experimental Protocol and Mathematical Metrics

The CLAP workflow comprises three principal stages:

  1. Activation Caching: Given a prompt, generate and store:
    • alclean:=h(l)(xclean)a_l^{\text{clean}} := h^{(l)}(x_{\text{clean}}) (clean prompt, correct answer)
    • alcorr:=h(l)(xcorr)a_l^{\text{corr}} := h^{(l)}(x_{\text{corr}}) (same prompt, incorrect answer)
  2. Logit-Difference Metric Computation: The model's preference is defined as the expected difference in logits between the token set tct_c (correct answer) and twt_w (incorrect answer):

Lclean=1tcitcfθ(xclean)[i],Lcorr=1twjtwfθ(xcorr)[j]L_{\text{clean}} = \frac{1}{|t_c|}\sum_{i\in t_c} f_\theta(x_{\text{clean}})[i], \quad L_{\text{corr}} = \frac{1}{|t_w|}\sum_{j\in t_w} f_\theta(x_{\text{corr}})[j]

Δ=LcleanLcorr\Delta = L_{\text{clean}} - L_{\text{corr}}

  1. Activation Patching (Causal Intervention): For each layer ll, perform:

    • Forward pass on xcorrx_{\text{corr}} with h(l)h^{(l)} set to alcleana_l^{\text{clean}} (i.e., do\text{do}-intervention).
    • Evaluate the recovered logit difference:

    Δlpatch=1tcitcfθ(xcorr;do(h(l)alclean))[i]1twjtwfθ(xcorr;do(h(l)alclean))[j]\Delta_l^{\text{patch}} = \frac{1}{|t_c|}\sum_{i\in t_c} f_\theta(x_{\text{corr}}; \text{do}(h^{(l)}\leftarrow a_l^{\text{clean}}))[i] - \frac{1}{|t_w|}\sum_{j\in t_w} f_\theta(x_{\text{corr}}; \text{do}(h^{(l)}\leftarrow a_l^{\text{clean}}))[j]

  • Compute fractional recovery:

    Recoveryl=ΔlpatchΔΔcleanΔ\text{Recovery}_l = \frac{\Delta_l^{\text{patch}}-\Delta}{\Delta_{\text{clean}}-\Delta}

    Recovery values range from $0$ (no effect) to $1$ (full restoration of preference).

Statistical validation proceeds via paired t-tests comparing pre- and post-patched Δ\Delta, with p-values demonstrating layer-specific significance (Bahador, 3 Apr 2025).

3. Layerwise and Edgewise Attribution: Practical Variants

CLAP applies at multiple granularities:

  • Layer-wise CLAP localizes contributions of transformer blocks and specific submodules (feedforward, output projection, convolutional).
  • Edge Attribution Patching (EAP): An “edgewise” generalization using a linear Taylor expansion for scalably attributing importance to each computational edge. EAP executes only two forward passes (clean, corrupted) and one backward pass (for all edges), leveraging a first-order approximation:

ΔeL(ecorreclean)eLe=eclean\Delta_e L \equiv (e_{\text{corr}}-e_{\text{clean}})^\top \nabla_e L|_{e=e_{\text{clean}}}

This produces absolute attribution scores ΔeL|\Delta_e L| for importance ranking and subgraph pruning. EAP achieves circuit-discovery comparable or superior to prior methods (e.g., ACDC) at orders-of-magnitude lower compute cost, with typical AUC values (IOI: 0.95 vs. ACDC’s 0.87; “Greater-Than”: 0.89 vs. 0.83) (Syed et al., 2023).

4. Empirical Findings and Quantitative Results

Application of CLAP to a 12-layer GPT-2 fine-tuned on 9,958 PubMed abstracts (epilepsy, EEG, seizure) highlights a functional dissociation between localized and distributed factual representations:

Layer / Component Fractional Recovery (%) Statistical Significance Interpretation
First feedforward sublayer ≈ 56% p < 0.001 Substantial, but incomplete, associative
Final output projection WoutW_{\text{out}} 100% p < 0.0001 Fully localized, definitional knowledge
Intermediate Conv1D 13.6% p = 0.008 Minor, low-level feature mixing

Intermediate and convolutional layers yield only partial recovery, whereas patching the final projection achieves full restoration of correct-answer preference (Bahador, 3 Apr 2025).

5. Mechanistic Interpretations and Knowledge Localization

The results differentiate factual knowledge organization in autoregressive transformers:

  • Definitional/single-hop knowledge (e.g., explicit acronym expansions) is highly localized in the output weights WoutW_{\text{out}}. Full (100%) recovery via patching indicates such knowledge is “packed” into a single layer.
  • Associative/multi-hop reasoning (e.g., indirect relationships) is distributed, requiring integration across several intermediate layers; no single patch suffices for complete recovery (e.g., only 56% at the first feedforward).
  • Low-level transformations (Conv1D, attention mixing) provide marginal contribution (13.6%), indicating limited involvement in explicit high-level retrieval.

A plausible implication is that model-editing interventions must be adaptive: direct factual updates (definitions) target output weights, while editing associative facts requires coordinated changes across a distributed pathway (Bahador, 3 Apr 2025).

6. Computational Efficiency and Practical Advantages

Relative to exhaustive layer/edge-level activation patching (as exemplified by ACDC), CLAP—especially in its EAP form—offers major efficiency gains. EAP assigns edgewise importance using only two forward passes and a single backward pass per example, as opposed to O(MN)O(MN) forward passes for MM edges and NN examples, achieving a 10310^3104×10^4\times reduction in computation on GPT-2-small (Syed et al., 2023). In practice, EAP outperforms or matches the ground-truth circuit recovery of previous methods while maintaining computational tractability for large models.

7. Limitations and Recommendations

Identified limitations include:

  • Poor linear approximation of non-linear metric responses, especially for embedding edges or highly non-linear metrics (e.g., KL-divergence at zero baseline).
  • Occasional systematic overestimation of effect sizes by EAP (empirically, best-fit slope ≈ 0.5), suggesting benefit from hybrid pipelines (EAP for initial pruning, followed by exact patching).
  • Task-adaptive methodology is essential: the localization versus distribution of knowledge is task-specific, requiring interpreters to tailor probe and edit strategies accordingly (Syed et al., 2023, Bahador, 3 Apr 2025).

These findings reconcile previous divergent observations regarding parameter localization in neural models and establish a rigorous, efficient foundation for mechanistic interpretability and model editing grounded in causal intervention principles.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Causal Layer Attribution via Activation Patching (CLAP).