EAP-IG: Edge Attribution with Integrated Gradients

Updated 13 February 2026

The paper introduces EAP-IG, a gradient-based method that integrates integrated gradients to yield more causally reliable edge attributions in transformer circuits.
It employs an m-step interpolation of activations to overcome vanishing gradient issues, efficiently estimating the contribution of each edge.
Empirical evaluations on tasks like Subject-Verb Agreement demonstrate that EAP-IG achieves higher normalized faithfulness and edge score correlations than prior approaches.

Edge Attribution Patching with Integrated Gradients (EAP-IG) is a gradient-based methodology for identifying and scoring edges in the computational graph of LLMs, specifically in the context of interpretability using the circuits framework. EAP-IG is designed to improve the faithfulness of extracted circuits—subgraphs explaining model behavior on specific tasks—while retaining scalability required for large models. By leveraging integrated gradients, EAP-IG addresses key limitations of previous gradient-based approximations to intervention and provides more causally reliable edge attributions, which is critical for trustworthy mechanistic interpretability in LLMs (Hanna et al., 2024).

1. Background: Circuits and Edge Attribution Patching

The circuits framework in LLM interpretability aims to identify the minimal computational subgraph (circuit) sufficient to explain model behavior for a given task. Traditionally, edges for these circuits are selected via causal interventions, patching activations along individual edges and measuring their effect on model performance. However, exhaustive intervention is computationally infeasible for large models.

Edge Attribution Patching (EAP) was introduced as a scalable, gradient-based proxy for these interventions. For a given edge $e = (u \to v)$ , with clean activation $z = Z_u$ and corrupted activation $z' = Z_u'$ (from a corrupted input), EAP estimates the causal effect on the loss $L$ from patching:

$\Delta L_e \approx (z' - z) \cdot \frac{\partial L}{\partial a_v}$

Here, $a_v$ is the pre-activation input to node $v$ in the clean forward pass. This method is computationally efficient, requiring only one backward pass on the clean input and one forward pass on the corrupted input, but may fail to reflect the true causal influence when the local gradient is zero, despite $L$ being sensitive to changes in $z$ .

2. Integrated Gradients: Motivation and Mathematical Formulation

The standard EAP formulation is similar to the "gradient × input" attribution scheme, which suffers when the gradient vanishes at the evaluation point. Integrated Gradients (IG) was developed to address this issue by integrating the gradient along a straight-line path in activation space between a baseline and the actual input, thereby accumulating nonzero gradient contributions even when endpoints have zero gradient [Sundararajan et al., 2017].

For a differentiable scalar-valued function $f(x)$ , given input $x$ and baseline $x'$ , the IG for coordinate $i$ is:

$IG_i(x) = (x_i - x'_i) \int_0^1 \frac{\partial f(x' + \alpha(x - x'))}{\partial x_i} d\alpha$

Practically, this is approximated via Riemann sums with $m$ steps:

$IG_i(x) \approx (x_i - x'_i) \frac{1}{m} \sum_{k=1}^m \frac{\partial f(x' + \frac{k}{m}(x - x'))}{\partial x_i}$

3. Adapting IG to Edge Attributions in Transformers

To adapt IG for circuits in transformers, each edge $(u \to v)$ is treated as a "feature" with clean value $z_u$ and baseline $z'_u$ (obtained from a corrupted input), and $f$ is taken as the loss $L$ computed over blended activations. The resulting EAP-IG score for edge $e$ is:

$\mathrm{EAP\mathchar`-IG}_e = (z_u - z'_u) \cdot \left(\frac{1}{m} \sum_{k=1}^m \frac{\partial L(a_v(\alpha_k))}{\partial a_v}\right)$

where $\alpha_k = \frac{k}{m}$ and

$a_u(\alpha) = (1 - \alpha) z'_u + \alpha z_u$

that is, a linear interpolation from the corrupted to the clean activation. This requires $m$ forward and backward passes for each edge (typically $m=5$ suffices in practice).

4. EAP-IG Algorithm: Circuit Discovery Workflow

The EAP-IG circuit identification procedure consists of the following steps:

Precomputing Activations: Run the model on corrupted inputs $s'$ to record $z'_u$ for every relevant edge.
Computing EAP-IG Scores:

For $k = 1, \ldots, m$ : - Set $u$ 's output for each edge to $z_k(u) = z'_u + \frac{k}{m}(z_u - z'_u)$ . - Forward pass with these blended activations to compute loss $L_k$ . - Backpropagate to obtain $g_k(u \to v) = \partial L_k / \partial a_v$ . Compute edge scores as above.

Greedy Circuit Assembly:
- Initialize circuit nodes $C_V$ with the logits node; edges $C_E$ is empty.
- Iteratively add the candidate edge with the largest $|S(e)|$ , where the child is in $C_V$ , until $|C_E|=N$ .
- Prune disconnected nodes.
Faithfulness Evaluation: Evaluate the circuit by ablating all edges outside $C_E$ , patching their activations to $z'_u$ , and compute the normalized faithfulness metric (see next section).

5. Faithfulness Metric and Its Significance

A circuit $C$ is called faithful if, when all non-circuit edges are "corrupted" (i.e., their activations are replaced with those from corrupted inputs $s'$ ), the model's performance remains approximately the same as when using clean inputs $s$ . For a node $v$ with incoming edges $E_v$ , and edge indicator $i_e=1$ for $e \in C$ ($0$ otherwise), the input after intervention is:

$a^C_v = \sum_{(u \to v) \in E_v} [ i_e z_u + (1 - i_e) z'_u ]$

Letting $M_C$ be the model output metric post-intervention, performance is normalized as:

$\text{Faithfulness}(C) = \frac{M_C - M_\text{corr}}{M_\text{clean} - M_\text{corr}}$

Values near 1 indicate maintenance of full-model behavior, values near 0 indicate collapse to the corrupted baseline. Unlike overlap-based measures (e.g., Jaccard index of node/edge sets), faithfulness directly quantifies causal sufficiency, avoiding misleading scenarios where structural similarity does not reflect mechanism preservation.

6. Empirical Results: Task Performance and Comparative Effectiveness

EAP-IG was evaluated on six mechanistic tasks with GPT-2 small: Indirect Object Identification (IOI), Gender-Bias, Greater-Than, Country-Capital, Subject-Verb Agreement (SVA), and Hypernymy. For each, edge scores were computed using EAP, EAP-IG (with $m=5$ ), and ground-truth activation patching. Circuits of varying size ( $n \in \{30,40,\ldots,100,200,\ldots,1000\}$ ) were assembled greedily.

Key empirical findings:

Across all six tasks, EAP-IG circuits exhibit higher normalized faithfulness than EAP circuits, frequently by margins of 0.1–0.2.
In Subject-Verb Agreement, EAP yields near-zero faithfulness for $n \leq 1000$ , whereas EAP-IG reaches $\geq 0.85$ at $n \approx 100$ .
For Greater-Than, the EAP-IG improvement is ~0.1 at $n \sim 200$ ; for Country-Capital, the gap is $\geq 0.2$ across most $n$ .
Node and edge overlap with activation patching circuits is comparably high for EAP and EAP-IG.
EAP-IG edge scores achieve higher Pearson correlation ( $r > 0.8$ ) with activation patching than EAP, and top- $n$ edge selections align more closely for small $n \leq 50$ .
While activation patching (ground truth) sometimes outperforms both, its computational demands are prohibitive for large models (Hanna et al., 2024).

7. Advantages, Limitations, and Implications for Scalable Interpretability

Advantages:

EAP-IG enhances circuit faithfulness, better preserving causal mechanisms over vanilla EAP.
The method remains scalable, requiring only $m$ additional forward/backward passes compared to EAP.
Compatible with any differentiable loss function, including KL divergence.
Recovers circuits with precision and recall comparable to manual circuits.

Limitations:

EAP-IG remains an approximation; activation patching circuits can still outperform it.
The need for a baseline (corrupted activations) and a hyperparameter $m$ ; however, small $m$ is typically sufficient.
Faithfulness evaluation is expensive for large models as it requires extensive intervention.
Potential for missing negative-causal-effect components if only absolute scores are considered.

Broader Implications:

Faithfulness is advocated as the primary criterion for evaluating circuit extraction, supplanting node/edge overlap, which fails to capture causal sufficiency. EAP-IG is positioned as a scalable and more faithful refinement over EAP, and forms a practical compromise for large-scale mechanistic study of LLMs. Future research directions include integrating completeness metrics, optimizing faithfulness evaluation, and developing weighted overlap measures that better accord with edge importance (Hanna et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Edge Attribution Patching with Integrated Gradients (EAP-IG).

EAP-IG: Edge Attribution with Integrated Gradients

1. Background: Circuits and Edge Attribution Patching

2. Integrated Gradients: Motivation and Mathematical Formulation

3. Adapting IG to Edge Attributions in Transformers

4. EAP-IG Algorithm: Circuit Discovery Workflow

5. Faithfulness Metric and Its Significance

6. Empirical Results: Task Performance and Comparative Effectiveness

7. Advantages, Limitations, and Implications for Scalable Interpretability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

EAP-IG: Edge Attribution with Integrated Gradients

1. Background: Circuits and Edge Attribution Patching

2. Integrated Gradients: Motivation and Mathematical Formulation

3. Adapting IG to Edge Attributions in Transformers

4. EAP-IG Algorithm: Circuit Discovery Workflow

5. Faithfulness Metric and Its Significance

6. Empirical Results: Task Performance and Comparative Effectiveness

7. Advantages, Limitations, and Implications for Scalable Interpretability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research