Model-Dependent Variations in Grad-CAM

Updated 26 January 2026

The paper demonstrates that Grad-CAM's localization accuracy, faithfulness, and consistency vary significantly across different model architectures and training setups.
It quantitatively evaluates metrics such as IoU, MOD, VID, and Spearman’s correlation to gauge the reliability of explanations under various conditions.
The study recommends algorithmic refinements and architecture-specific strategies to improve Grad-CAM interpretability, especially in high-stakes applications.

Gradient-weighted Class Activation Mapping (Grad-CAM) is widely used for post hoc interpretability in deep neural networks, producing spatial or token-level saliency maps that purport to reveal the evidence for a model’s prediction. However, Grad-CAM explanations are highly model-dependent: the localization quality, faithfulness, stability, and semantic alignment of the resulting heatmaps can vary significantly across architectures, datasets, training regimes, and tasks. Understanding these variations—and their implications for scientific, medical, or high-stakes applications—requires a rigorous and quantitative approach.

1. Grad-CAM Methodology and Formalism

Grad-CAM explanations are constructed by backpropagating class-specific gradients from the output to a chosen convolutional (or analogous) layer. For an input $x$ , for class $c$ with pre-softmax output $y^c$ , and feature maps $A^k \in \mathbb{R}^{H\times W}$ , Grad-CAM computes:

Channel weights: $\alpha_k^c = \frac{1}{Z} \sum_{i=1}^H \sum_{j=1}^W \frac{\partial y^c}{\partial A^k_{ij}}$ , with $Z=H\cdot W$
Localization map: $L^c_{\mathrm{Grad\text{-}CAM}}(x) = \mathrm{ReLU}(\sum_{k} \alpha^c_k A^k)$ , upsampled to input resolution

This formalism is adapted to various domains (e.g. 1D for text (Gorski et al., 2020), 3D for microscopy (Gopalakrishnan et al., 2024)), attention maps in transformers, or even patch-level activations. Quantitative metrics for saliency evaluation—such as overlap with ground-truth masks, faithfulness to causal perturbations, and consistency across runs—are essential for comparing explanations between models (Panboonyuen, 19 Jan 2026).

2. Empirical Evidence of Model-Dependent Variation

Comprehensive multi-architecture experiments consistently demonstrate that Grad-CAM’s quality depends strongly on both model design and training pipeline.

Vision

In lung cancer CT analysis across five state-of-the-art models—ResNet-50, ResNet-101, DenseNet-161, EfficientNet-B0, ViT-Base—Grad-CAM localization accuracy (IoU with tumor masks) ranged from 0.70 (ResNet-101) to 0.48 (ViT), with corresponding drops in faithfulness and consistency. Variability was statistically significant ( $p < 0.01$ across CNN vs. ViT) (Panboonyuen, 19 Jan 2026). Certain architectures (e.g., transformers with nonlocal self-attention) yield diffuse, unstable, or anatomically irrelevant heatmaps.

In legal-text CNNs, Grad-CAM explanations shift from sharply localized (static word embeddings) to increasingly diffuse (contextualized BERT/DistilBERT encoders), reflecting representational spread (Gorski et al., 2020). Intersection-over-union between models’ heatmaps confirms substantial intra-family agreement and lower cross-family consistency.

Adversarial Perturbation

Model-dependent resilience is observed under adversarial attacks. Metrics such as Mean Observed Dissimilarity (MOD) and Variation in Dissimilarity (VID) exhibit lower values for deep residual networks (e.g. VGG16: MOD=0.13, VID=0.0021 at $\epsilon=0.10$ ) and higher for “wider” or branch-parallel architectures (e.g. XceptionNet: MOD=0.28, VID=0.0134), revealing differential stability in attention drift (Chakraborty et al., 2022).

Human Alignment

Spearman’s correlation between Grad-CAM and human ClickMe maps varies widely depending on model backbone. For instance, ConvNeXt-Base achieves $\rho=0.76$ , whereas CAFormer-S18 and ConvFormer-S18 yield $\rho=0.17$ and $\rho=-0.04$ , respectively, under standard Grad-CAM (Chamas et al., 2024). Such low values indicate that specific architectural design can render explanations essentially uninformative.

3. Architectural and Training Factors Shaping Interpretability

Explanatory variation arises from multiple model-related axes:

Spatial inductive bias: CNNs with strong spatial priors yield coherent, focused Grad-CAM maps; transformers with self-attention produce more global, less spatially meaningful explanations (Panboonyuen, 19 Jan 2026).
Residual and depthwise-separable structures: Residual connections in ResNets confer moderate stability to Grad-CAM focus under perturbation; deeply branched modules (Inception, Xception) are more vulnerable to erratic attention shift (Chakraborty et al., 2022).
Activation functions and computation pipeline: The use of non-standard activations (e.g. StarReLU in MetaFormers) can disrupt the sign structure of activation maps, leading to nearly random or blank Grad-CAM maps unless the method is algorithmically revised (“Threshold-Grad-CAM”) (Chamas et al., 2024).
Contextual encoding in NLP: Simple static embedding channels yield sharp, token-level explanations. BERT-style deep contextualization increases attention spread, decreasing heatmap interpretability at higher thresholds, although improving classification accuracy (Gorski et al., 2020).
Domain adaptation: Poor domain adaptation (e.g., fine-tuning large models on under-sized corpora) impairs Grad-CAM localization stability and interpretability (Gorski et al., 2020).

4. Metrics for Quantifying Model-Dependent Grad-CAM Variation

Objective comparison of Grad-CAM interpretability across models requires rigorous, domain-appropriate quantification.

Metric	Definition	Usage
Grad-CAMO	$\frac{\sum_{i,j} G_{ij} M_{ij}}{\sum_{i,j} G_{ij}}$	Focus overlap in cell images (Gopalakrishnan et al., 2024)
Localization Accuracy	$\mathrm{IoU}(L^c_{\mathrm{Grad\text{-}CAM}}, M)$	Lesion/cell/region alignment
Perturbation Faithfulness	$\Delta$ confidence when salient regions occluded	Attribution causal testing
Explanation Consistency	Mean pairwise IoU over retrains	Stability under re-initialization
Fraction Above Threshold	$F(v,t) = N_{v \ge t} / N$ for token/saliency heatmaps	Explanation sharpness in text
Intersection-over-Union (I)	Jaccard for binary-thresholded explanations	Inter-model explanation overlap
NISSIM/MOD/VID	Dissimilarity metrics for adversarial focus drift	Robustness auditing (Chakraborty et al., 2022)
Content-Heatmap (CH)	Normalized heatmap mass in annotated regions	Human-alignment, fine-grained (Pillai et al., 2021)
Spearman’s $\rho$ (ClickMe)	Correlation with human attention	Human interpretability alignment (Chamas et al., 2024)

These metrics reveal that models can achieve high task accuracy while producing heatmaps unrelated to true evidence (spurious localization, off-target focus, background exploitation, etc.) (Gopalakrishnan et al., 2024, Panboonyuen, 19 Jan 2026).

5. Remediation: Architectural, Algorithmic, and Training Remedies

Architectural and methodological advances can substantially improve Grad-CAM interpretability:

Learnable receptive fields (DCLS): Replacing fixed-grid convolutions with Dilated Convolution with Learnable Spacing increases alignment of Grad-CAM with human strategies in seven of eight models, as measured by Spearman’s $\rho$ ( $+0.04\!\sim\!+0.08$ on ConvNeXt variants) (Chamas et al., 2024).
Algorithmic refinements: Threshold-Grad-CAM, which applies ReLU channel-wise before summing and thresholds output, dramatically recovers meaningful heatmaps from architectures where standard Grad-CAM fails (CAFormer-S18: $\rho$ rises from 0.17 to 0.56; ConvFormer-S18: $-0.04$ to $0.71$) (Chamas et al., 2024).
Contrastive Grad-CAM Consistency (CGC) training: Adding a loss term to encourage invariance of explanations under spatial transforms increases overlap with human/ground truth masks and regularizes feature learning, raising content-heatmap metrics (54.8% $\to$ 71.8% on ImageNet, ResNet50) and insertion AUC (Pillai et al., 2021).
Regularization via explanation-based loss: Penalizing "off-target" mass (e.g. low Grad-CAMO) during training directs attention to relevant regions (Gopalakrishnan et al., 2024).
Model selection via explanation metrics: Quantitative comparison of candidate backbones or training regimes using Grad-CAMO, consistency, and faithfulness can guide selection towards more trustworthy feature extractors (Gopalakrishnan et al., 2024, Panboonyuen, 19 Jan 2026).

6. Practical Implications and Recommendations

Empirical findings make clear that Grad-CAM’s interpretability should be considered an architecture-dependent property, not a universal diagnostic. Practitioners deploying post hoc explanations should:

Always audit explanations with quantitative criteria (overlap, faithfulness, consistency, human alignment) before model deployment (Gopalakrishnan et al., 2024, Panboonyuen, 19 Jan 2026, Chamas et al., 2024).
Compare explanations across several architectures—differences are often substantial and may change local clinical or scientific conclusions (Panboonyuen, 19 Jan 2026).
For hybrid transformers or models with unconventional nonlinearities, verify generated heatmaps and, if necessary, swap to alternative or modified Grad-CAM implementations (Threshold-Grad-CAM, attention-based visualization).
In modality-specific contexts, prefer architectures and training regimes shown to maximize interpretive alignment (e.g., DCLS for vision, static embeddings for legal NLP when pinpoint rationales are needed).
Treat model accuracy and explanation quality as distinct axes; high validation accuracy does not guarantee trustworthy explanations (Gopalakrishnan et al., 2024, Gorski et al., 2020).

7. Outlook: Limitations and Future Research Directions

Current evidence underscores several limitations in using Grad-CAM as a model-agnostic explainer:

Models with non-spatial or nonlocal mechanisms (Vision Transformers, StarReLU activations) often produce failure cases for standard Grad-CAM, requiring method adaptation (Chamas et al., 2024, Panboonyuen, 19 Jan 2026).
Saliency maps may be confounded by batch artifacts, imaging site, or background leakage, necessitating per-application calibration and filtering (Gopalakrishnan et al., 2024).
Adversarial robustness of explanations is highly variable; per-model MOD and VID metrics should accompany any claims of interpretability (Chakraborty et al., 2022).
Integrated Grad-CAM offers improved sensitivity in some deep CNNs, but at notable compute cost and the need for baseline selection (Sattarzadeh et al., 2021).
Further work is required to generalize these findings to multi-modal, sequence, and complex hybrid architectures, and to standardize explanation quantification across domains.

This suggests that the utility of Grad-CAM for insight or regulatory justification must always be validated empirically within the specific architectural and training context under consideration. Where interpretability is essential, model-aware, explainability-informed design, regularization, and evaluation should become standard practice.