Interpretability-Driven Metrics

Updated 29 December 2025

Interpretability-driven metrics are quantitative measures that assess the clarity, faithfulness, and robustness of AI model explanations for improved human interpretability.
They are categorized into functionally-grounded, human-grounded, and theoretic measures, providing systematic methods to audit and benchmark explanation quality.
Key metrics such as Information Transfer Rate, Interpretive Efficiency, and Lipschitz stability guide model selection and deployment in critical domains like healthcare and NLP.

Interpretability-driven metrics are quantitative measures designed to evaluate, compare, and ultimately improve the human interpretability of machine learning models and their explanations. Rather than relying solely on task accuracy or subjective assessments, these metrics formalize aspects such as human simulatability, cognitive simplicity, fidelity to the underlying model, agreement with expert reasoning, and the stability of explanations under perturbations. They are critical for systematically auditing, benchmarking, and selecting interpretability methods across a spectrum of use-cases, including high-stakes domains such as healthcare and language understanding.

1. Categories and Formal Properties of Interpretability-Driven Metrics

Interpretability-driven metrics can be broadly categorized along the axes of functionally-grounded, human-grounded, and robustly-theoretic measures. Common properties evaluated include faithfulness (explanation fidelity to the model), plausibility (alignment to human intuition or annotation), simplicity (cognitive tractability), broadness (coverage and generality), and robustness (stability under input perturbation).

Category	Typical Metric Types	Example References
Functionally-grounded	Occlusion, ablation, mutual information	(Turbé et al., 2022, Nguyen et al., 2020)
Human-grounded	Simulatability, information transfer, agreement	(Schmidt et al., 2019, Ren et al., 8 Dec 2025)
Theoretic/axiomatic	Interpretive efficiency, information-geometric	(Katende, 6 Dec 2025, Do et al., 2019)
Robustness/stability	Lipschitz, variance-based	(Alvarez-Melis et al., 2018, Wang et al., 2022)

2. Representative Metrics and Their Mathematical Definitions

Information Transfer Rate (ITR) and Trust Coefficient

ITR measures the interpretability of a method by quantifying how quickly and accurately humans replicate the predictions of a model when provided with its explanation: $\mathrm{ITR} = \frac{I(\hat Y_H, \hat Y_{ML})}{t},$ where $I(\cdot,\cdot)$ is mutual information between human and model labels, and $t$ is the average response time. The trust coefficient $T$ compares ITR with respect to the model against ITR with respect to ground truth: $T = \frac{\mathrm{ITR}_{\hat Y_{ML}}}{\mathrm{ITR}_Y}.$ $T>1$ indicates over-reliance on the model (Schmidt et al., 2019).

Interpretive Efficiency (IE)

IE quantifies the fraction of task-relevant information transmitted by an interpretive representation $\varphi$ : $\mathrm{IE}(\varphi;N) = \frac{S(\varphi;N)}{S_{\rm ref}(N)},$ where $S(\varphi;N)$ is a task-specific interpretive score and $S_{\rm ref}(N)$ is a full-information reference. IE is normalized to [0,1], satisfying axioms such as Blackwell monotonicity, data-processing stability, and asymptotic consistency (Katende, 6 Dec 2025).

Simplicity, Broadness, and Fidelity

Nguyen & Rodríguez‐Martínez et al. decompose explanations into metrics on both the feature extractor and the explanation itself:

Feature simplicity: $I(X;Z)$ , lower means more integration/less detail.
Fidelity: $I(Z;Y)$ , higher means greater retained predictive content.
For example-based methods: Non-representativeness (average task loss), diversity (pairwise distance), cognitive complexity (number of examples).
For attribution methods: monotonicity (rank-correlation between attribution and true impact), non-sensitivity (exact zeros), and effective complexity (support size for near-complete explanation) (Nguyen et al., 2020).

Coverage-Weighted IR Metrics for Rules

In rule-based systems, standard IR metrics (TF/IDF) are adapted:

Weighted TF: coverage-weighted count of attribute appearances.
IDF: log of the ratio total activated rules to the number containing a factor.
Relevance: aggregated, inverted, and scaled TF×IDF so that common (not rare) factors are highlighted, enabling quantification of factor importance for global and local explanations (Umbrico et al., 8 Jul 2025).

3. Faithfulness, Plausibility, and Robustness: Faithful Attribution and Alignment Metrics

Metrics targeting faithfulness (fidelity to the actual model behavior) typically leverage ablation or perturbation:

Pixel Flipping/Deletion AUC: Measures how quickly predicted class probability drops as "important" features are masked; faithfulness requires the most impactful features to be identified first (Wang et al., 2022).
Sufficiency and Comprehensiveness: Assess the minimal set of features required to retain the model prediction, or how much prediction drops when top features are removed (Zhou et al., 2022).
Human–Machine Interpretability (HMI): Overlap of machine-saliency maps with expert-annotated relevance, accounting for mass consistency (Turbé et al., 2022).
Consistency: Mean average precision (MAP) between token importance rankings before and after controlled perturbations, measuring stability under modification (Wang et al., 2022).

Alignment or plausibility is assessed via:

Agreement with human or clinical rankings, as in Shapley-based contrast importance for medical imaging, using normalized Spearman Footrule or other rank correlation metrics (Ren et al., 8 Dec 2025).
The Pointing Game: Hit-rate of explaining maps landing inside human-annotated object regions (Wang et al., 2022).

Robustness is formalized using local Lipschitz estimates: $\hat{L}(x_i) = \max_{x_j \in B_\epsilon(x_i)} \frac{\|f(x_i)-f(x_j)\|_2}{\|x_i - x_j\|_2}$ for continuous settings or discrete sample neighborhoods, with high $\hat{L}$ indicating instability of explanations to small input changes (Alvarez-Melis et al., 2018).

4. Domain-Specific and Task-Specific Metrics

Interpretability-driven metrics are adapted to diverse tasks:

Program Synthesis: The LLM-based INTerpretability (LINT) score measures how faithfully a program can be described in language and then reconstructed by an LLM, benchmarking "readable" code via behavioral equivalence (Bashir et al., 2023).
Medical Segmentation: Agreement and uncertainty metrics are derived from contrast-level Shapley value rankings, mapping alignment to clinical judgement and fold variance as proxies for reliability (Ren et al., 8 Dec 2025).
NLP Rationales: Quality (IQS) is a convex combination of plausibility (Jaccard with human-selected features), simplicity (chunk count, aligning with cognitive limits), and reproducibility (agreement between human and model predictions) (Xie et al., 2022).
Disentangled Representation Learning: RMIG (Robust Mutual Information Gap) and JEMMIG (Joint Entropy minus Mutual Information Gap) quantify whether latent coordinates isolate around known factors, using information-theoretic principles (Do et al., 2019).

5. Critical Discussion: Successes, Limitations, and Best Practices

A recurrent theme is the trade-off and tension between faithfulness to the model's mechanism and plausibility to human priors. Performance-based metrics (e.g., area under the deletion curve, comprehensiveness) can be confounded by distribution shift during masking or substitution, nonadditive contribution structures, and overfitting to test metrics (Wang et al., 2022, Zhou et al., 2022). Alignment metrics (such as agreement with human rationales) risk rewarding explainer "style" over substance—an explainer could, for example, reproduce human-expected object boundaries regardless of the model's decision logic.

To mitigate these issues:

Always separate faithfulness and plausibility in evaluation reporting.
Minimize out-of-distribution effects by using realistic masking baselines and multiple perturbation methods.
Exploit both human- and functionally-grounded experiments; neither alone suffices to fully characterize interpretability utility.
Benchmark robustness using Lipschitz or variance-based stability metrics before deploying explanations in critical settings (Alvarez-Melis et al., 2018).
Post-hoc attribution metrics should be used with care when selecting systems for clinical or regulatory deployment; functionally faithful, robust, and cognitively accessible explanations have empirically shown to increase user satisfaction and trust (Umbrico et al., 8 Jul 2025, Ren et al., 8 Dec 2025).

6. Emerging Trends and Future Directions

There is a growing movement toward axiomatic and information-geometric foundations for interpretability metrics, exemplified by Interpretive Efficiency (Katende, 6 Dec 2025). These frameworks aspire to unify disparate evaluation approaches, supplying invariance, monotonicity, and normalization guarantees. Domain-specific adaptations (e.g., agreement/uncertainty in medical imaging, LLM-based reconstructability for code) suggest that interpretability-driven metrics must retain flexibility for end-use requirements.

Open challenges remain:

Designing benchmarks and metrics that effectively balance robustness, flexibility, faithfulness, and human insight.
Developing perturbation schemes and baseline choices that yield fair and informative comparisons.
Building interpretable pipelines that separate and quantify the contribution of feature extraction versus explanation, thus making the tradeoffs between simplicity, broadness, and fidelity explicit (Nguyen et al., 2020).

In summary, interpretability-driven metrics furnish the quantitative backbone for research and practice in explainable artificial intelligence, enabling principled, reproducible, and context-aware assessment of both models and explanations across modalities and domains. Continued technical refinement and empirical grounding are essential for their successful and trustworthy deployment.