Papers
Topics
Authors
Recent
Search
2000 character limit reached

EAP-IG: Edge Attribution with Integrated Gradients

Updated 13 February 2026
  • The paper introduces EAP-IG, a gradient-based method that integrates integrated gradients to yield more causally reliable edge attributions in transformer circuits.
  • It employs an m-step interpolation of activations to overcome vanishing gradient issues, efficiently estimating the contribution of each edge.
  • Empirical evaluations on tasks like Subject-Verb Agreement demonstrate that EAP-IG achieves higher normalized faithfulness and edge score correlations than prior approaches.

Edge Attribution Patching with Integrated Gradients (EAP-IG) is a gradient-based methodology for identifying and scoring edges in the computational graph of LLMs, specifically in the context of interpretability using the circuits framework. EAP-IG is designed to improve the faithfulness of extracted circuits—subgraphs explaining model behavior on specific tasks—while retaining scalability required for large models. By leveraging integrated gradients, EAP-IG addresses key limitations of previous gradient-based approximations to intervention and provides more causally reliable edge attributions, which is critical for trustworthy mechanistic interpretability in LLMs (Hanna et al., 2024).

1. Background: Circuits and Edge Attribution Patching

The circuits framework in LLM interpretability aims to identify the minimal computational subgraph (circuit) sufficient to explain model behavior for a given task. Traditionally, edges for these circuits are selected via causal interventions, patching activations along individual edges and measuring their effect on model performance. However, exhaustive intervention is computationally infeasible for large models.

Edge Attribution Patching (EAP) was introduced as a scalable, gradient-based proxy for these interventions. For a given edge e=(u→v)e = (u \to v), with clean activation z=Zuz = Z_u and corrupted activation z′=Zu′z' = Z_u' (from a corrupted input), EAP estimates the causal effect on the loss LL from patching:

ΔLe≈(z′−z)⋅∂L∂av\Delta L_e \approx (z' - z) \cdot \frac{\partial L}{\partial a_v}

Here, ava_v is the pre-activation input to node vv in the clean forward pass. This method is computationally efficient, requiring only one backward pass on the clean input and one forward pass on the corrupted input, but may fail to reflect the true causal influence when the local gradient is zero, despite LL being sensitive to changes in zz.

2. Integrated Gradients: Motivation and Mathematical Formulation

The standard EAP formulation is similar to the "gradient × input" attribution scheme, which suffers when the gradient vanishes at the evaluation point. Integrated Gradients (IG) was developed to address this issue by integrating the gradient along a straight-line path in activation space between a baseline and the actual input, thereby accumulating nonzero gradient contributions even when endpoints have zero gradient [Sundararajan et al., 2017].

For a differentiable scalar-valued function f(x)f(x), given input xx and baseline x′x', the IG for coordinate ii is:

IGi(x)=(xi−xi′)∫01∂f(x′+α(x−x′))∂xidαIG_i(x) = (x_i - x'_i) \int_0^1 \frac{\partial f(x' + \alpha(x - x'))}{\partial x_i} d\alpha

Practically, this is approximated via Riemann sums with mm steps:

IGi(x)≈(xi−xi′)1m∑k=1m∂f(x′+km(x−x′))∂xiIG_i(x) \approx (x_i - x'_i) \frac{1}{m} \sum_{k=1}^m \frac{\partial f(x' + \frac{k}{m}(x - x'))}{\partial x_i}

3. Adapting IG to Edge Attributions in Transformers

To adapt IG for circuits in transformers, each edge (u→v)(u \to v) is treated as a "feature" with clean value zuz_u and baseline zu′z'_u (obtained from a corrupted input), and ff is taken as the loss LL computed over blended activations. The resulting EAP-IG score for edge ee is:

$\mathrm{EAP\mathchar`-IG}_e = (z_u - z'_u) \cdot \left(\frac{1}{m} \sum_{k=1}^m \frac{\partial L(a_v(\alpha_k))}{\partial a_v}\right)$

where αk=km\alpha_k = \frac{k}{m} and

au(α)=(1−α)zu′+αzua_u(\alpha) = (1 - \alpha) z'_u + \alpha z_u

that is, a linear interpolation from the corrupted to the clean activation. This requires mm forward and backward passes for each edge (typically m=5m=5 suffices in practice).

4. EAP-IG Algorithm: Circuit Discovery Workflow

The EAP-IG circuit identification procedure consists of the following steps:

  1. Precomputing Activations: Run the model on corrupted inputs s′s' to record zu′z'_u for every relevant edge.
  2. Computing EAP-IG Scores:

For k=1,…,mk = 1, \ldots, m: - Set uu's output for each edge to zk(u)=zu′+km(zu−zu′)z_k(u) = z'_u + \frac{k}{m}(z_u - z'_u). - Forward pass with these blended activations to compute loss LkL_k. - Backpropagate to obtain gk(u→v)=∂Lk/∂avg_k(u \to v) = \partial L_k / \partial a_v. Compute edge scores as above.

  1. Greedy Circuit Assembly:
    • Initialize circuit nodes CVC_V with the logits node; edges CEC_E is empty.
    • Iteratively add the candidate edge with the largest ∣S(e)∣|S(e)|, where the child is in CVC_V, until ∣CE∣=N|C_E|=N.
    • Prune disconnected nodes.
  2. Faithfulness Evaluation: Evaluate the circuit by ablating all edges outside CEC_E, patching their activations to zu′z'_u, and compute the normalized faithfulness metric (see next section).

5. Faithfulness Metric and Its Significance

A circuit CC is called faithful if, when all non-circuit edges are "corrupted" (i.e., their activations are replaced with those from corrupted inputs s′s'), the model's performance remains approximately the same as when using clean inputs ss. For a node vv with incoming edges EvE_v, and edge indicator ie=1i_e=1 for e∈Ce \in C ($0$ otherwise), the input after intervention is:

avC=∑(u→v)∈Ev[iezu+(1−ie)zu′]a^C_v = \sum_{(u \to v) \in E_v} [ i_e z_u + (1 - i_e) z'_u ]

Letting MCM_C be the model output metric post-intervention, performance is normalized as:

Faithfulness(C)=MC−McorrMclean−Mcorr\text{Faithfulness}(C) = \frac{M_C - M_\text{corr}}{M_\text{clean} - M_\text{corr}}

Values near 1 indicate maintenance of full-model behavior, values near 0 indicate collapse to the corrupted baseline. Unlike overlap-based measures (e.g., Jaccard index of node/edge sets), faithfulness directly quantifies causal sufficiency, avoiding misleading scenarios where structural similarity does not reflect mechanism preservation.

6. Empirical Results: Task Performance and Comparative Effectiveness

EAP-IG was evaluated on six mechanistic tasks with GPT-2 small: Indirect Object Identification (IOI), Gender-Bias, Greater-Than, Country-Capital, Subject-Verb Agreement (SVA), and Hypernymy. For each, edge scores were computed using EAP, EAP-IG (with m=5m=5), and ground-truth activation patching. Circuits of varying size (n∈{30,40,…,100,200,…,1000}n \in \{30,40,\ldots,100,200,\ldots,1000\}) were assembled greedily.

Key empirical findings:

  • Across all six tasks, EAP-IG circuits exhibit higher normalized faithfulness than EAP circuits, frequently by margins of 0.1–0.2.
  • In Subject-Verb Agreement, EAP yields near-zero faithfulness for n≤1000n \leq 1000, whereas EAP-IG reaches ≥0.85\geq 0.85 at n≈100n \approx 100.
  • For Greater-Than, the EAP-IG improvement is ~0.1 at n∼200n \sim 200; for Country-Capital, the gap is ≥0.2\geq 0.2 across most nn.
  • Node and edge overlap with activation patching circuits is comparably high for EAP and EAP-IG.
  • EAP-IG edge scores achieve higher Pearson correlation (r>0.8r > 0.8) with activation patching than EAP, and top-nn edge selections align more closely for small n≤50n \leq 50.
  • While activation patching (ground truth) sometimes outperforms both, its computational demands are prohibitive for large models (Hanna et al., 2024).

7. Advantages, Limitations, and Implications for Scalable Interpretability

Advantages:

  • EAP-IG enhances circuit faithfulness, better preserving causal mechanisms over vanilla EAP.
  • The method remains scalable, requiring only mm additional forward/backward passes compared to EAP.
  • Compatible with any differentiable loss function, including KL divergence.
  • Recovers circuits with precision and recall comparable to manual circuits.

Limitations:

  • EAP-IG remains an approximation; activation patching circuits can still outperform it.
  • The need for a baseline (corrupted activations) and a hyperparameter mm; however, small mm is typically sufficient.
  • Faithfulness evaluation is expensive for large models as it requires extensive intervention.
  • Potential for missing negative-causal-effect components if only absolute scores are considered.

Broader Implications:

Faithfulness is advocated as the primary criterion for evaluating circuit extraction, supplanting node/edge overlap, which fails to capture causal sufficiency. EAP-IG is positioned as a scalable and more faithful refinement over EAP, and forms a practical compromise for large-scale mechanistic study of LLMs. Future research directions include integrating completeness metrics, optimizing faithfulness evaluation, and developing weighted overlap measures that better accord with edge importance (Hanna et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Edge Attribution Patching with Integrated Gradients (EAP-IG).