Saliency Metric Taxonomy Overview
- Saliency Metric Taxonomy is a structured framework that categorizes evaluation metrics by methodological principles, reliability regimes, application scenarios, and evolving desiderata.
- It integrates perturbation-based, calibration, and human-aligned metrics to assess the faithfulness, sensitivity, and logical consistency of feature-attribution maps.
- The framework guides the selection of complementary metrics to overcome single-method limitations and to provide robust, actionable insights for explainable AI.
Saliency metrics provide quantitative frameworks for evaluating the explanatory value of feature-attribution maps generated by deep neural networks, especially in image classification. The diversity of tasks, user intents, and model architectures has driven the emergence of a wide spectrum of metrics, each with unique operational definitions, mathematical criteria, and evaluation protocols. Recent research both consolidates foundational taxonomies and proposes new metrics to address faithfulness, robustness, calibration, human alignment, and cognitive relevance across contexts. The taxonomy presented below integrates these developments, organizing the landscape by methodological principle, reliability regime, application scenario, and evolving desiderata.
1. Foundational Metric Families and Taxonomies
Saliency metrics are broadly classified by the nature of explanation task and the evaluation signal employed. Standard taxonomies distinguish metrics as follows:
- Location-based (fixation-query, classifier-style):
- AUC-Judd, AUC-Borji, sAUC
- Evaluate whether high-saliency pixels overlap human eye fixations or annotated object regions (Bylinskii et al., 2016, Gide et al., 2017).
- Value-based (normalized map):
- Normalized Scanpath Saliency (NSS), Weighted NSS (WNSS), Shuffled NSS (sNSS/sWNSS)
- Quantify mean or locally weighted saliency at fixations, correcting for dataset (center) biases (Gide et al., 2017, Kalash et al., 2018).
- Distribution-based (map-to-map):
- Pearson Correlation Coefficient (CC), Histogram Intersection (SIM), Kullback–Leibler Divergence (KL), Earth Mover’s Distance (EMD), Mean Absolute Error (MAE)
- Measure pixelwise similarity, linear agreement, information loss, and spatial/work distances between predicted and ground-truth maps (Bylinskii et al., 2016, Gide et al., 2017).
- Overlap/Segmentation metrics:
- Intersection over Union (IoU), F-measure, Pointing Game (PG), IoSR
- Assess overlap between binary predictions and annotated masks or bounding boxes (Kalash et al., 2018, Li et al., 2020).
- Ranking-based and Retrieval metrics:
- Spearman’s ρ, Kendall’s τ, Mean Absolute Rank Error (MARE), normalized DCG (nDCG), Precision@k
- Evaluate the ordering of detected objects by saliency, supporting multi-object and relative ranking tasks (Kalash et al., 2018).
Metrics vary in their sensitivity to false positives/negatives, treatment of spatial or dataset bias, reliance on preprocessing and normalization, and interpretability in specific application domains. The consensus is that no single metric suffices; complementarity is required for faithful, robust assessment (Bylinskii et al., 2016, Boggust et al., 2022).
2. Perturbation-Based, Faithfulness, Sensitivity, and Consistency Metrics
Perturbation-based metrics interrogate the causal and functional impact of saliency assignments by modifying the input or model and quantifying resulting output changes. Key frameworks and metrics include:
- AOPC (Area Over Perturbation Curve) MoRF/LeRF: Tracks output collapse upon sequential removal of most/least relevant pixels, with orderings encoding different notions of necessity/sufficiency (Tomsett et al., 2019). Implementation details (mean-value/random-RGB perturbations) and empirical variances are critical for reliability.
- Deletion/Insertion AUC (DAUC/IAUC/iAUC): Measures how rapidly the model's score for a target class drops (or recovers) upon masking (or inserting) pixels in saliency rank order (Gomez et al., 2022, Li et al., 2020, Boggust et al., 2022). These metrics are limited by exclusive reliance on ranking (not raw values), and can be confounded by OOD effects of masked/blurred samples.
- Completeness, Soundness, and Intrinsic Metrics: Completeness (can a mask certify the true label?) and soundness (is it impossible to certify a false label?) are quantified via normalized ratios of post-modification scores to originals, tested for worst-case labels (Gupta et al., 2022). These offer robust defenses against adversarial mask artifacts and support rigorous prooflike saliency evaluation.
- Consistency and Sensitivity (COSE):
- Consistency captures invariance/equivariance: saliency maps must be stable under data augmentations that do not alter model outputs.
- Sensitivity reflects fidelity: saliency must change when model predictions change (due to input or model updates).
- COSE combines both via harmonic mean, exposing architectural dependencies (transformers > ConvNets in stability), method zones (CAM-based vs. path-based vs. noise-averaged), and trade-offs needed for faithful explanations (Daroya et al., 2023).
Taxonomic Table: Perturbation-Based Metric Highlights
| Metric | Principle | Measurement Axis |
|---|---|---|
| AOPC MoRF/LeRF | Faithfulness | Output collapse curve |
| DAUC/IAUC | Ranking faithfulness | Score drop/recovery |
| Completeness/Soundness (Gupta et al., 2022) | Intrinsic proof | Output bounds |
| COSE (Daroya et al., 2023) | Consistency/Sensitivity | Harmonic mean |
3. Reliability, Calibration, and Composite Metrics
Reliability and calibration metrics evaluate statistical stability and the alignment of saliency values with functional impacts:
- Faithfulness (Pearson correlation): Correlates per-pixel saliency values with empirically measured output drops after perturbation (Tomsett et al., 2019). Exhibits low inter-rater and inter-method reliability.
- Sparsity and Sharpening: Measures (normalized top value/mean) quantify how focused a saliency map is; useful as a supplement to ranking-only faithfulness metrics (Gomez et al., 2022, Boggust et al., 2022).
- Calibration (Deletion/Insertion Correlation): Correlates saliency magnitude with actual output influence, exposing lack of direct alignment and low calibration in current methods (Gomez et al., 2022).
- Classification-Confusion Metrics (Precision, Recall, Specificity, FPR/FNR, Accuracy, F1): Applied to mosaicked images, these yield complete confusion matrices and facilitate psychometric reliability testing via Krippendorff’s α and Spearman’s ρ, providing inter-rater and inter-method reliability analysis (Fresz et al., 2024).
- Channel-Pruning Saliency Taxonomy: All channel-pruning metrics decompose into domain, pointwise function, reduction, scaling axes, supporting systemic exploration and novel metric construction (Persand et al., 2019).
4. Reference-Frame & Human-Aligned Taxonomies
Recent work has articulated new conceptual taxonomies to match explanation scenarios to user intent:
- Reference-Frame × Granularity (RF×G):
- Reference-frame: pointwise ("Why this?") vs. contrastive ("Why A not B?")
- Granularity: class-level (fine) vs. group-level (coarse, e.g. WordNet synsets)
- Quadrant-specific metrics (CCS, CGC, PGS, CGS) are constructed for each explanation type, using structured masking and AUC aggregation over perturbed inputs (Elisha et al., 17 Nov 2025).
Table: RF×G Taxonomy and Associated Metrics
| Axes | Class-level | Group-level |
|---|---|---|
| Pointwise | PGS | Pointwise Group Score |
| Contrastive | CCS | Contrastive Group Score |
Empirical findings indicate higher metric scores for group-level (coarse) explanations; IIA (Iterated Integrated Attributions) consistently outperforms Grad-CAM, Score-CAM, and Integrated Gradients across all RF×G metrics. Practitioners are advised to select metrics matching user-question granularity, and avoid model-centric, single-axis faithfulness assessments (Elisha et al., 17 Nov 2025).
5. Specialized Metrics: Logical Consistency, Label Sensitivity, Perceptual Alignment
- Logical Order-encoding: On controlled logic datasets, metrics such as Needed Information below Baseline (NIB), Logical Accuracy, Logical Statistical Accuracy, Full/Minimal Double Class Assignment (DCA) are precisely defined to detect failures in monotonicity, reason-preservation, and unintended encoding of class-discriminative information into mask patterns (Schwenke et al., 2024). These metrics enforce strong constraints—no relevant feature should score below baseline; masking should preserve the logical function; retrained models should not gain extra discriminatory power from saliency-induced rankings.
- Label Sensitivity/Model Consistency: Data randomization and model contrast scorings provide tests of attribution method sensitivity to label changes and functional transformations in the network (Boggust et al., 2022, Li et al., 2020).
- Perceptibility & Human Alignment: Localization metrics (IoU, PG, plausibility ratings), minimality/sparsity measures, and mean IoU aggregate human-subjective resemblance, noise-level, and visual focus (Gide et al., 2017, Boggust et al., 2022).
6. Metric Selection, Evaluation, and Future Directions
Metric selection is dictated by application requirements, the cognitive/semantic goals of the explanation, and the underlying model/data regime.
- Use composite suites (faithfulness, stability, localization, calibration, minimality, correspondence) rather than single metrics (Boggust et al., 2022, Bylinskii et al., 2016).
- Evaluate reliability via psychometric scores (Krippendorff’s α, Spearman’s ρ); report per-image distributions, avoid mere means (Tomsett et al., 2019, Fresz et al., 2024).
- For datasets or tasks with ground-truth rank/order, apply ranking metrics (Spearman, Kendall, nDCG, Precision@k) (Kalash et al., 2018).
- For medical/fairness-critical domains, favor high soundness, robust calibration, and human-alignment protocols (Gupta et al., 2022, Elisha et al., 17 Nov 2025).
- Address open issues: distribution-shift in perturbations (OOD artifacts), aggregation over multi-map architectures, label sensitivity, composite metric design, and the impact of negative attributions or order-encoding (Gomez et al., 2022, Schwenke et al., 2024).
- Extend taxonomies for new model classes (transformers, ViTs) and user-driven evaluation axes (Daroya et al., 2023, Elisha et al., 17 Nov 2025).
7. Summary Table: Major Metric Families and Characteristics
| Family/Metric | Measures | Key Trade-offs/Features |
|---|---|---|
| AUC, sAUC, FN-AUC | Detection/ranking; bias correction | Sensitive to center, periphery bias; selection of negatives (Jia et al., 2020) |
| NSS, WNSS, sNSS | Value at fixations, density | Center-bias correction; local object weighting (Gide et al., 2017) |
| CC, SIM, KL, EMD | Map similarity/dissimilarity | Linear/statistical; spatial; information-theoretic (Bylinskii et al., 2016) |
| Insertion/Deletion | Faithfulness (sufficiency/necessity) | Impact of top-k/least-k regions; OOD artifacts (Li et al., 2020, Gomez et al., 2022) |
| Completeness/Soundness | Logical bounds/prooflike reasoning | Intrinsic mask evaluation; prevents adversarial map artifacts (Gupta et al., 2022) |
| RF×G Metrics | User-intent aligned faithfulness | Task-specific quadrant explanations (Elisha et al., 17 Nov 2025) |
| Classification Metrics | Reliability/confusion analysis | Full confusion matrix, psychometric reliability analysis (Fresz et al., 2024) |
| Channel-Pruning | Structural parameter saliency | Multi-axis design, direct impact on model sparsity (Persand et al., 2019) |
| Logical Consistency | Monotonicity, order-encoding | Controlled ground-truth testbeds (Schwenke et al., 2024) |
Saliency metric taxonomy thus encompasses a hierarchy of methodologies—from perturbation-driven faithfulness scores to cognitive-aligned, reference-frame-aware evaluation frameworks—each exposing distinct facets of explanation reliability, interpretability, and practical utility. The imperative in current research is to triangulate between these axes, combining statistical rigor, model-behavior alignment, and user-centric semantics for robust, actionable interpretability.