EAP-IG: Edge Attribution with Integrated Gradients
- The paper introduces EAP-IG, a gradient-based method that integrates integrated gradients to yield more causally reliable edge attributions in transformer circuits.
- It employs an m-step interpolation of activations to overcome vanishing gradient issues, efficiently estimating the contribution of each edge.
- Empirical evaluations on tasks like Subject-Verb Agreement demonstrate that EAP-IG achieves higher normalized faithfulness and edge score correlations than prior approaches.
Edge Attribution Patching with Integrated Gradients (EAP-IG) is a gradient-based methodology for identifying and scoring edges in the computational graph of LLMs, specifically in the context of interpretability using the circuits framework. EAP-IG is designed to improve the faithfulness of extracted circuits—subgraphs explaining model behavior on specific tasks—while retaining scalability required for large models. By leveraging integrated gradients, EAP-IG addresses key limitations of previous gradient-based approximations to intervention and provides more causally reliable edge attributions, which is critical for trustworthy mechanistic interpretability in LLMs (Hanna et al., 2024).
1. Background: Circuits and Edge Attribution Patching
The circuits framework in LLM interpretability aims to identify the minimal computational subgraph (circuit) sufficient to explain model behavior for a given task. Traditionally, edges for these circuits are selected via causal interventions, patching activations along individual edges and measuring their effect on model performance. However, exhaustive intervention is computationally infeasible for large models.
Edge Attribution Patching (EAP) was introduced as a scalable, gradient-based proxy for these interventions. For a given edge , with clean activation and corrupted activation (from a corrupted input), EAP estimates the causal effect on the loss from patching:
Here, is the pre-activation input to node in the clean forward pass. This method is computationally efficient, requiring only one backward pass on the clean input and one forward pass on the corrupted input, but may fail to reflect the true causal influence when the local gradient is zero, despite being sensitive to changes in .
2. Integrated Gradients: Motivation and Mathematical Formulation
The standard EAP formulation is similar to the "gradient × input" attribution scheme, which suffers when the gradient vanishes at the evaluation point. Integrated Gradients (IG) was developed to address this issue by integrating the gradient along a straight-line path in activation space between a baseline and the actual input, thereby accumulating nonzero gradient contributions even when endpoints have zero gradient [Sundararajan et al., 2017].
For a differentiable scalar-valued function , given input and baseline , the IG for coordinate is:
Practically, this is approximated via Riemann sums with steps:
3. Adapting IG to Edge Attributions in Transformers
To adapt IG for circuits in transformers, each edge is treated as a "feature" with clean value and baseline (obtained from a corrupted input), and is taken as the loss computed over blended activations. The resulting EAP-IG score for edge is:
$\mathrm{EAP\mathchar`-IG}_e = (z_u - z'_u) \cdot \left(\frac{1}{m} \sum_{k=1}^m \frac{\partial L(a_v(\alpha_k))}{\partial a_v}\right)$
where and
that is, a linear interpolation from the corrupted to the clean activation. This requires forward and backward passes for each edge (typically suffices in practice).
4. EAP-IG Algorithm: Circuit Discovery Workflow
The EAP-IG circuit identification procedure consists of the following steps:
- Precomputing Activations: Run the model on corrupted inputs to record for every relevant edge.
- Computing EAP-IG Scores:
For : - Set 's output for each edge to . - Forward pass with these blended activations to compute loss . - Backpropagate to obtain . Compute edge scores as above.
- Greedy Circuit Assembly:
- Initialize circuit nodes with the logits node; edges is empty.
- Iteratively add the candidate edge with the largest , where the child is in , until .
- Prune disconnected nodes.
- Faithfulness Evaluation: Evaluate the circuit by ablating all edges outside , patching their activations to , and compute the normalized faithfulness metric (see next section).
5. Faithfulness Metric and Its Significance
A circuit is called faithful if, when all non-circuit edges are "corrupted" (i.e., their activations are replaced with those from corrupted inputs ), the model's performance remains approximately the same as when using clean inputs . For a node with incoming edges , and edge indicator for ($0$ otherwise), the input after intervention is:
Letting be the model output metric post-intervention, performance is normalized as:
Values near 1 indicate maintenance of full-model behavior, values near 0 indicate collapse to the corrupted baseline. Unlike overlap-based measures (e.g., Jaccard index of node/edge sets), faithfulness directly quantifies causal sufficiency, avoiding misleading scenarios where structural similarity does not reflect mechanism preservation.
6. Empirical Results: Task Performance and Comparative Effectiveness
EAP-IG was evaluated on six mechanistic tasks with GPT-2 small: Indirect Object Identification (IOI), Gender-Bias, Greater-Than, Country-Capital, Subject-Verb Agreement (SVA), and Hypernymy. For each, edge scores were computed using EAP, EAP-IG (with ), and ground-truth activation patching. Circuits of varying size () were assembled greedily.
Key empirical findings:
- Across all six tasks, EAP-IG circuits exhibit higher normalized faithfulness than EAP circuits, frequently by margins of 0.1–0.2.
- In Subject-Verb Agreement, EAP yields near-zero faithfulness for , whereas EAP-IG reaches at .
- For Greater-Than, the EAP-IG improvement is ~0.1 at ; for Country-Capital, the gap is across most .
- Node and edge overlap with activation patching circuits is comparably high for EAP and EAP-IG.
- EAP-IG edge scores achieve higher Pearson correlation () with activation patching than EAP, and top- edge selections align more closely for small .
- While activation patching (ground truth) sometimes outperforms both, its computational demands are prohibitive for large models (Hanna et al., 2024).
7. Advantages, Limitations, and Implications for Scalable Interpretability
Advantages:
- EAP-IG enhances circuit faithfulness, better preserving causal mechanisms over vanilla EAP.
- The method remains scalable, requiring only additional forward/backward passes compared to EAP.
- Compatible with any differentiable loss function, including KL divergence.
- Recovers circuits with precision and recall comparable to manual circuits.
Limitations:
- EAP-IG remains an approximation; activation patching circuits can still outperform it.
- The need for a baseline (corrupted activations) and a hyperparameter ; however, small is typically sufficient.
- Faithfulness evaluation is expensive for large models as it requires extensive intervention.
- Potential for missing negative-causal-effect components if only absolute scores are considered.
Broader Implications:
Faithfulness is advocated as the primary criterion for evaluating circuit extraction, supplanting node/edge overlap, which fails to capture causal sufficiency. EAP-IG is positioned as a scalable and more faithful refinement over EAP, and forms a practical compromise for large-scale mechanistic study of LLMs. Future research directions include integrating completeness metrics, optimizing faithfulness evaluation, and developing weighted overlap measures that better accord with edge importance (Hanna et al., 2024).