Grad-ELLM: LLM Attribution Method

Updated 13 January 2026

Grad-ELLM is a method that provides per-token attribution for decoder-only LLMs by integrating gradient sensitivity with spatial attention.
It leverages layer-wise linear decomposition and channel-wise gradient weighting to efficiently compute token contributions without extensive perturbation.
The framework introduces π-Soft-NS and π-Soft-NC metrics for a rigorous evaluation of attribution faithfulness, addressing computational and methodological challenges.

Grad-ELLM (Gradient-based Explanations for Decoder-only LLMs) is a specialized input-attribution framework designed to provide faithful and efficient explanations for the outputs of decoder-only transformer-based LLMs. Developed to address the limitations of generic attribution methods on modern autoregressive architectures, Grad-ELLM combines sensitivity analysis through gradients with internal transformer attention mechanisms to yield step-wise, per-token interpretability that is both computationally tractable and empirically robust (Huang et al., 6 Jan 2026).

1. Motivation and Limitations of Model-Agnostic Attribution

The predominance of large autoregressive LLMs has prompted demand for transparent and faithful attribution methods, specifically those assignable on a per-token, per-step basis. Conventional model-agnostic approaches—such as LIME, vanilla saliency, Integrated Gradients, and DeepLIFT—suffer critical drawbacks when applied to decoder-only transformers:

Transformer Architectural Blindness: They neglect the role of structured components like self-attention and feed-forward layers, treating the network as a monolithic function and disregarding the compositional flow of information.
Computational Inefficiency: Many require $\mathcal{O}(n^2)$ or more forward passes per input, as in perturbation-based schemes that mask or re-sample tokens individually, leading to intractability on long contexts.
Non-natural Perturbation Regimes: Techniques that employ “hard” perturbations—e.g., token deletion or outright masking—may produce off-manifold inputs never encountered during model training, resulting in misleading faithfulness scores.

Grad-ELLM circumvents these issues by exploiting the operational structure of the transformer decoder, maintaining alignment with the model's native computation flow (Huang et al., 6 Jan 2026).

2. Attribution Mechanism in Grad-ELLM

The Grad-ELLM method synthesizes gradient-based channel importance and spatial attention scores at every generation step, leveraging internal transformer representations without necessitating architectural modification.

Layer-wise Linear Decomposition

At each generation step for output token $y_t$ , the model computes a logit vector $\ell_t$ conditioned on all prior inputs and outputs. Grad-ELLM decomposes the logit via a first-order Taylor expansion as:

$\ell_t \approx \sum_{k=0}^{N-1} \mathrm{LP}(o^{(k)}_t)$

where $o^{(k)}_t$ is the output of the $k$ -th attention block and $\mathrm{LP}$ denotes the final linear projection.

Channel-wise Gradient Weighting

For each layer $k$ , the partial logit $\ell_t^{(k)}$ is locally linearized:

$f_k(o_t) \approx w^{(k)} \cdot o_t, \quad \text{with}\quad w^{(k)} = \frac{\partial \ell_t}{\partial o_t^{(k)}}$

These $y_t$ 0 weights represent channel sensitivities, evaluating the marginal effect of each channel on the target logit.

Spatial Scoring via Attention

The attention output at each layer is itself a weighted sum over all contextual tokens. For input token $y_t$ 1 in layer $y_t$ 2:

$y_t$ 3

where $y_t$ 4 are normalized attention weights (optionally “loosened” to [0,1] to temper softmax peaking) and $y_t$ 5 is the $y_t$ 6-th component of the value vector for token $y_t$ 7.

Aggregating $y_t$ 8 over layers yields attribution heatmaps $y_t$ 9 tracing input-token influence on each output at every generation step.

3. Faithfulness Metrics: $\ell_t$ 0-Soft-NS and $\ell_t$ 1-Soft-NC

Classical faithfulness evaluation in attribution employs “hard” token deletions/insertion, measuring impact on model outputs, but is vulnerable to metric inflation if retention probabilities differ across methods. Grad-ELLM generalizes “soft” perturbation metrics to ensure fair, distribution-controlled comparisons by introducing the $\ell_t$ 2-Soft-NS (“sufficiency”) and $\ell_t$ 3-Soft-NC (“comprehensiveness”) metrics.

Soft Masking: Each input token $\ell_t$ 4 is zeroed independently with probability $\ell_t$ 5, where the attribution score $\ell_t$ 6 is transformed via an $\ell_t$ 7-exponent to enforce $\ell_t$ 8 (target proportion retained).
Evaluation: Faithfulness is measured by the normalized Hellinger distance in output distributions: a sweep across $\ell_t$ 9 generates $\ell_t \approx \sum_{k=0}^{N-1} \mathrm{LP}(o^{(k)}_t)$ 0-Soft curves, summarized via area-under-curve (AUC) for quantitative assessment.

This approach eliminates spurious performance variance due to differing mean retention rates, enabling rigorous, apples-to-apples method comparison (Huang et al., 6 Jan 2026).

4. Experimental Evaluation and Results

Experiments assessed Grad-ELLM on sentiment classification (IMDb, SST2), yes/no question answering (BoolQ), and open-generation tasks (TellMeWhy, WikiBio), utilizing LLaMA-7B and Mistral-7B as backbone models.

Baselines: Raw attention, vanilla saliency, Input $\ell_t \approx \sum_{k=0}^{N-1} \mathrm{LP}(o^{(k)}_t)$ 1Grad, Integrated Gradients, DeepLIFT, Layer-GradCAM, Value Zeroing, and Random.
Metrics: $\ell_t \approx \sum_{k=0}^{N-1} \mathrm{LP}(o^{(k)}_t)$ 2-Soft-NS and $\ell_t \approx \sum_{k=0}^{N-1} \mathrm{LP}(o^{(k)}_t)$ 3-Soft-NC AUCs, classical insertion/deletion, and qualitative heat maps.

Quantitative Highlights

Model	$\ell_t \approx \sum_{k=0}^{N-1} \mathrm{LP}(o^{(k)}_t)$ 4-Soft-NS AUC	$\ell_t \approx \sum_{k=0}^{N-1} \mathrm{LP}(o^{(k)}_t)$ 5-Soft-NC AUC	Best Baseline $\ell_t \approx \sum_{k=0}^{N-1} \mathrm{LP}(o^{(k)}_t)$ 6-Soft-NS / NC
LLaMA-7B	0.401	1.115	Random (0.379/1.078); DeepLIFT/Saliency (0.339), Input $\ell_t \approx \sum_{k=0}^{N-1} \mathrm{LP}(o^{(k)}_t)$ 7Grad (0.974)
Mistral-7B	0.383	0.491	Random (0.548)

Grad-ELLM leads on $\ell_t \approx \sum_{k=0}^{N-1} \mathrm{LP}(o^{(k)}_t)$ 8-Soft-NS and $\ell_t \approx \sum_{k=0}^{N-1} \mathrm{LP}(o^{(k)}_t)$ 9-Soft-NC (LLaMA), with notable qualitative clarity in attribution heatmaps (“very little positive” review in IMDb, “10/10 :)” in SST2).
On Mistral-7B, Random dominates $o^{(k)}_t$ 0-Soft-NC, suggesting that grouped-query/sliding-window attention may induce uniformity in token dependencies (Huang et al., 6 Jan 2026).

Classical insertion/deletion metrics are more sensitive to the density of attribution maps, with Grad-ELLM ranking behind sparse baselines under these tests.

5. Computational Complexity and Applicability

Grad-ELLM requires one backward pass per generation step to compute $o^{(k)}_t$ 1, alongside extraction of precomputed attention maps. The resulting per-token time and space complexity is $o^{(k)}_t$ 2, significantly lower than perturbation-based schemes. However, this method presupposes white-box access to the model’s internals and is thus not applicable to black-box or API-limited deployments.

Grad-ELLM's approach is highly amenable to implementation on open-source transformer models but not directly compatible with closed-source or proprietary systems where access to attention maps and gradients is restricted.

6. Limitations and Future Directions

Grad-ELLM’s performance degrades in scenarios demanding extremely sparse attribution (e.g., top- $o^{(k)}_t$ 3 selection), as its denser “loosened” heatmaps may underperform on traditional insertion/deletion metrics. Remediation via thresholding or top- $o^{(k)}_t$ 4 filtering can selectively increase map sparsity. The framework exclusively addresses faithfulness in the causal sense—effect on model outputs—and does not assess human-centric plausibility or interpretability.

Future extensions proposed include adaptation to instruction-tuned and multimodal decoder-only architectures, integration with global causal-tracing techniques, and optimization of the attention-loosening transform to balance interpretive sparsity and noise robustness.

7. Impact and Broader Significance

Grad-ELLM demonstrates that incorporating both the channel-wise gradient sensitivity and spatial attention structure of decoder-only transformers enables attribution methods that are more aligned with the model's underlying computation, outperforming both naïvely attention-based and generic model-agnostic baselines in terms of faithfulness. Its methodology refines evaluation practice for interpretability research through equitable, parameter-controlled faithfulness metrics, and establishes a new basis for explanatory techniques tailored to advanced autoregressive LLMs. As the landscape of foundation models evolves, methods like Grad-ELLM will inform future directions in mechanistic interpretability and diagnostic toolkits for black-box generative models (Huang et al., 6 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Grad-ELLM: Gradient-based Explanations for Decoder-only LLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grad-ELLM.