Papers
Topics
Authors
Recent
Search
2000 character limit reached

Saliency Metric Taxonomy Overview

Updated 4 January 2026
  • Saliency Metric Taxonomy is a structured framework that categorizes evaluation metrics by methodological principles, reliability regimes, application scenarios, and evolving desiderata.
  • It integrates perturbation-based, calibration, and human-aligned metrics to assess the faithfulness, sensitivity, and logical consistency of feature-attribution maps.
  • The framework guides the selection of complementary metrics to overcome single-method limitations and to provide robust, actionable insights for explainable AI.

Saliency metrics provide quantitative frameworks for evaluating the explanatory value of feature-attribution maps generated by deep neural networks, especially in image classification. The diversity of tasks, user intents, and model architectures has driven the emergence of a wide spectrum of metrics, each with unique operational definitions, mathematical criteria, and evaluation protocols. Recent research both consolidates foundational taxonomies and proposes new metrics to address faithfulness, robustness, calibration, human alignment, and cognitive relevance across contexts. The taxonomy presented below integrates these developments, organizing the landscape by methodological principle, reliability regime, application scenario, and evolving desiderata.

1. Foundational Metric Families and Taxonomies

Saliency metrics are broadly classified by the nature of explanation task and the evaluation signal employed. Standard taxonomies distinguish metrics as follows:

  • Location-based (fixation-query, classifier-style):
  • Value-based (normalized map):
    • Normalized Scanpath Saliency (NSS), Weighted NSS (WNSS), Shuffled NSS (sNSS/sWNSS)
    • Quantify mean or locally weighted saliency at fixations, correcting for dataset (center) biases (Gide et al., 2017, Kalash et al., 2018).
  • Distribution-based (map-to-map):
    • Pearson Correlation Coefficient (CC), Histogram Intersection (SIM), Kullback–Leibler Divergence (KL), Earth Mover’s Distance (EMD), Mean Absolute Error (MAE)
    • Measure pixelwise similarity, linear agreement, information loss, and spatial/work distances between predicted and ground-truth maps (Bylinskii et al., 2016, Gide et al., 2017).
  • Overlap/Segmentation metrics:
  • Ranking-based and Retrieval metrics:
    • Spearman’s ρ, Kendall’s τ, Mean Absolute Rank Error (MARE), normalized DCG (nDCG), Precision@k
    • Evaluate the ordering of detected objects by saliency, supporting multi-object and relative ranking tasks (Kalash et al., 2018).

Metrics vary in their sensitivity to false positives/negatives, treatment of spatial or dataset bias, reliance on preprocessing and normalization, and interpretability in specific application domains. The consensus is that no single metric suffices; complementarity is required for faithful, robust assessment (Bylinskii et al., 2016, Boggust et al., 2022).

2. Perturbation-Based, Faithfulness, Sensitivity, and Consistency Metrics

Perturbation-based metrics interrogate the causal and functional impact of saliency assignments by modifying the input or model and quantifying resulting output changes. Key frameworks and metrics include:

  • AOPC (Area Over Perturbation Curve) MoRF/LeRF: Tracks output collapse upon sequential removal of most/least relevant pixels, with orderings encoding different notions of necessity/sufficiency (Tomsett et al., 2019). Implementation details (mean-value/random-RGB perturbations) and empirical variances are critical for reliability.
  • Deletion/Insertion AUC (DAUC/IAUC/iAUC): Measures how rapidly the model's score for a target class drops (or recovers) upon masking (or inserting) pixels in saliency rank order (Gomez et al., 2022, Li et al., 2020, Boggust et al., 2022). These metrics are limited by exclusive reliance on ranking (not raw values), and can be confounded by OOD effects of masked/blurred samples.
  • Completeness, Soundness, and Intrinsic Metrics: Completeness (can a mask certify the true label?) and soundness (is it impossible to certify a false label?) are quantified via normalized ratios of post-modification scores to originals, tested for worst-case labels (Gupta et al., 2022). These offer robust defenses against adversarial mask artifacts and support rigorous prooflike saliency evaluation.
  • Consistency and Sensitivity (COSE):
    • Consistency captures invariance/equivariance: saliency maps must be stable under data augmentations that do not alter model outputs.
    • Sensitivity reflects fidelity: saliency must change when model predictions change (due to input or model updates).
    • COSE combines both via harmonic mean, exposing architectural dependencies (transformers > ConvNets in stability), method zones (CAM-based vs. path-based vs. noise-averaged), and trade-offs needed for faithful explanations (Daroya et al., 2023).

Taxonomic Table: Perturbation-Based Metric Highlights

Metric Principle Measurement Axis
AOPC MoRF/LeRF Faithfulness Output collapse curve
DAUC/IAUC Ranking faithfulness Score drop/recovery
Completeness/Soundness (Gupta et al., 2022) Intrinsic proof Output bounds
COSE (Daroya et al., 2023) Consistency/Sensitivity Harmonic mean

3. Reliability, Calibration, and Composite Metrics

Reliability and calibration metrics evaluate statistical stability and the alignment of saliency values with functional impacts:

  • Faithfulness (Pearson correlation): Correlates per-pixel saliency values with empirically measured output drops after perturbation (Tomsett et al., 2019). Exhibits low inter-rater and inter-method reliability.
  • Sparsity and Sharpening: Measures (normalized top value/mean) quantify how focused a saliency map is; useful as a supplement to ranking-only faithfulness metrics (Gomez et al., 2022, Boggust et al., 2022).
  • Calibration (Deletion/Insertion Correlation): Correlates saliency magnitude with actual output influence, exposing lack of direct alignment and low calibration in current methods (Gomez et al., 2022).
  • Classification-Confusion Metrics (Precision, Recall, Specificity, FPR/FNR, Accuracy, F1): Applied to mosaicked images, these yield complete confusion matrices and facilitate psychometric reliability testing via Krippendorff’s α and Spearman’s ρ, providing inter-rater and inter-method reliability analysis (Fresz et al., 2024).
  • Channel-Pruning Saliency Taxonomy: All channel-pruning metrics decompose into domain, pointwise function, reduction, scaling axes, supporting systemic exploration and novel metric construction (Persand et al., 2019).

4. Reference-Frame & Human-Aligned Taxonomies

Recent work has articulated new conceptual taxonomies to match explanation scenarios to user intent:

  • Reference-Frame × Granularity (RF×G):
    • Reference-frame: pointwise ("Why this?") vs. contrastive ("Why A not B?")
    • Granularity: class-level (fine) vs. group-level (coarse, e.g. WordNet synsets)
    • Quadrant-specific metrics (CCS, CGC, PGS, CGS) are constructed for each explanation type, using structured masking and AUC aggregation over perturbed inputs (Elisha et al., 17 Nov 2025).

Table: RF×G Taxonomy and Associated Metrics

Axes Class-level Group-level
Pointwise PGS Pointwise Group Score
Contrastive CCS Contrastive Group Score

Empirical findings indicate higher metric scores for group-level (coarse) explanations; IIA (Iterated Integrated Attributions) consistently outperforms Grad-CAM, Score-CAM, and Integrated Gradients across all RF×G metrics. Practitioners are advised to select metrics matching user-question granularity, and avoid model-centric, single-axis faithfulness assessments (Elisha et al., 17 Nov 2025).

5. Specialized Metrics: Logical Consistency, Label Sensitivity, Perceptual Alignment

  • Logical Order-encoding: On controlled logic datasets, metrics such as Needed Information below Baseline (NIB), Logical Accuracy, Logical Statistical Accuracy, Full/Minimal Double Class Assignment (DCA) are precisely defined to detect failures in monotonicity, reason-preservation, and unintended encoding of class-discriminative information into mask patterns (Schwenke et al., 2024). These metrics enforce strong constraints—no relevant feature should score below baseline; masking should preserve the logical function; retrained models should not gain extra discriminatory power from saliency-induced rankings.
  • Label Sensitivity/Model Consistency: Data randomization and model contrast scorings provide tests of attribution method sensitivity to label changes and functional transformations in the network (Boggust et al., 2022, Li et al., 2020).
  • Perceptibility & Human Alignment: Localization metrics (IoU, PG, plausibility ratings), minimality/sparsity measures, and mean IoU aggregate human-subjective resemblance, noise-level, and visual focus (Gide et al., 2017, Boggust et al., 2022).

6. Metric Selection, Evaluation, and Future Directions

Metric selection is dictated by application requirements, the cognitive/semantic goals of the explanation, and the underlying model/data regime.

7. Summary Table: Major Metric Families and Characteristics

Family/Metric Measures Key Trade-offs/Features
AUC, sAUC, FN-AUC Detection/ranking; bias correction Sensitive to center, periphery bias; selection of negatives (Jia et al., 2020)
NSS, WNSS, sNSS Value at fixations, density Center-bias correction; local object weighting (Gide et al., 2017)
CC, SIM, KL, EMD Map similarity/dissimilarity Linear/statistical; spatial; information-theoretic (Bylinskii et al., 2016)
Insertion/Deletion Faithfulness (sufficiency/necessity) Impact of top-k/least-k regions; OOD artifacts (Li et al., 2020, Gomez et al., 2022)
Completeness/Soundness Logical bounds/prooflike reasoning Intrinsic mask evaluation; prevents adversarial map artifacts (Gupta et al., 2022)
RF×G Metrics User-intent aligned faithfulness Task-specific quadrant explanations (Elisha et al., 17 Nov 2025)
Classification Metrics Reliability/confusion analysis Full confusion matrix, psychometric reliability analysis (Fresz et al., 2024)
Channel-Pruning Structural parameter saliency Multi-axis design, direct impact on model sparsity (Persand et al., 2019)
Logical Consistency Monotonicity, order-encoding Controlled ground-truth testbeds (Schwenke et al., 2024)

Saliency metric taxonomy thus encompasses a hierarchy of methodologies—from perturbation-driven faithfulness scores to cognitive-aligned, reference-frame-aware evaluation frameworks—each exposing distinct facets of explanation reliability, interpretability, and practical utility. The imperative in current research is to triangulate between these axes, combining statistical rigor, model-behavior alignment, and user-centric semantics for robust, actionable interpretability.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Saliency Metric Taxonomy.