CLAP: Causal Layer Attribution for Neural Models
- The paper introduces CLAP, a method using intervention-based activation patching to quantify layer contributions to a model’s factual output.
- It employs a three-stage protocol—activation caching, logit-difference metric computation, and activation patching—to yield both qualitative insights and quantitative recovery metrics.
- Results reveal distinct knowledge localization, with output layers fully restoring performance and intermediate layers showing partial, distributed effects.
Causal Layer Attribution via Activation Patching (CLAP) is a mechanistic interpretability and model-editing technique that quantifies the contribution of individual neural network layers (or edges in computational graphs) to a model’s factual output preferences. Grounded in the framework of causal interventions, CLAP assesses which layers are causally responsible for producing correct versus incorrect outputs by systematically patching activations from a reference (clean) run into a corrupted (distractor) context. This approach produces both qualitative mechanistic insight—differentiating localized and distributed knowledge—and quantitative metrics useful for targeted model editing and circuit discovery (Bahador, 3 Apr 2025, Syed et al., 2023).
1. Formal Foundations and Causal Framework
CLAP is formulated within the Structural Causal Model (SCM) paradigm, treating hidden activations at each layer as “treatable” variables amenable to surgical do-interventions. For a LLM with transformer layers, input , and output vocabulary , the hidden states propagate as follows: Intervening at layer via is equivalent to the SCM operation , directly testing the causal effect of this layer’s activations on the logit output and, by extension, on downstream preference for correct versus incorrect answers (Bahador, 3 Apr 2025).
2. CLAP Experimental Protocol and Mathematical Metrics
The CLAP workflow comprises three principal stages:
- Activation Caching: Given a prompt, generate and store:
- (clean prompt, correct answer)
- (same prompt, incorrect answer)
- Logit-Difference Metric Computation: The model's preference is defined as the expected difference in logits between the token set (correct answer) and (incorrect answer):
- Activation Patching (Causal Intervention): For each layer , perform:
- Forward pass on with set to (i.e., -intervention).
- Evaluate the recovered logit difference:
- Compute fractional recovery:
Recovery values range from $0$ (no effect) to $1$ (full restoration of preference).
Statistical validation proceeds via paired t-tests comparing pre- and post-patched , with p-values demonstrating layer-specific significance (Bahador, 3 Apr 2025).
3. Layerwise and Edgewise Attribution: Practical Variants
CLAP applies at multiple granularities:
- Layer-wise CLAP localizes contributions of transformer blocks and specific submodules (feedforward, output projection, convolutional).
- Edge Attribution Patching (EAP): An “edgewise” generalization using a linear Taylor expansion for scalably attributing importance to each computational edge. EAP executes only two forward passes (clean, corrupted) and one backward pass (for all edges), leveraging a first-order approximation:
This produces absolute attribution scores for importance ranking and subgraph pruning. EAP achieves circuit-discovery comparable or superior to prior methods (e.g., ACDC) at orders-of-magnitude lower compute cost, with typical AUC values (IOI: 0.95 vs. ACDC’s 0.87; “Greater-Than”: 0.89 vs. 0.83) (Syed et al., 2023).
4. Empirical Findings and Quantitative Results
Application of CLAP to a 12-layer GPT-2 fine-tuned on 9,958 PubMed abstracts (epilepsy, EEG, seizure) highlights a functional dissociation between localized and distributed factual representations:
| Layer / Component | Fractional Recovery (%) | Statistical Significance | Interpretation |
|---|---|---|---|
| First feedforward sublayer | ≈ 56% | p < 0.001 | Substantial, but incomplete, associative |
| Final output projection | 100% | p < 0.0001 | Fully localized, definitional knowledge |
| Intermediate Conv1D | 13.6% | p = 0.008 | Minor, low-level feature mixing |
Intermediate and convolutional layers yield only partial recovery, whereas patching the final projection achieves full restoration of correct-answer preference (Bahador, 3 Apr 2025).
5. Mechanistic Interpretations and Knowledge Localization
The results differentiate factual knowledge organization in autoregressive transformers:
- Definitional/single-hop knowledge (e.g., explicit acronym expansions) is highly localized in the output weights . Full (100%) recovery via patching indicates such knowledge is “packed” into a single layer.
- Associative/multi-hop reasoning (e.g., indirect relationships) is distributed, requiring integration across several intermediate layers; no single patch suffices for complete recovery (e.g., only 56% at the first feedforward).
- Low-level transformations (Conv1D, attention mixing) provide marginal contribution (13.6%), indicating limited involvement in explicit high-level retrieval.
A plausible implication is that model-editing interventions must be adaptive: direct factual updates (definitions) target output weights, while editing associative facts requires coordinated changes across a distributed pathway (Bahador, 3 Apr 2025).
6. Computational Efficiency and Practical Advantages
Relative to exhaustive layer/edge-level activation patching (as exemplified by ACDC), CLAP—especially in its EAP form—offers major efficiency gains. EAP assigns edgewise importance using only two forward passes and a single backward pass per example, as opposed to forward passes for edges and examples, achieving a – reduction in computation on GPT-2-small (Syed et al., 2023). In practice, EAP outperforms or matches the ground-truth circuit recovery of previous methods while maintaining computational tractability for large models.
7. Limitations and Recommendations
Identified limitations include:
- Poor linear approximation of non-linear metric responses, especially for embedding edges or highly non-linear metrics (e.g., KL-divergence at zero baseline).
- Occasional systematic overestimation of effect sizes by EAP (empirically, best-fit slope ≈ 0.5), suggesting benefit from hybrid pipelines (EAP for initial pruning, followed by exact patching).
- Task-adaptive methodology is essential: the localization versus distribution of knowledge is task-specific, requiring interpreters to tailor probe and edit strategies accordingly (Syed et al., 2023, Bahador, 3 Apr 2025).
These findings reconcile previous divergent observations regarding parameter localization in neural models and establish a rigorous, efficient foundation for mechanistic interpretability and model editing grounded in causal intervention principles.