Evaluating Model Explanations without Ground Truth

Published 15 May 2025 in cs.AI and cs.LG | (2505.10399v1)

Abstract: There can be many competing and contradictory explanations for a single model prediction, making it difficult to select which one to use. Current explanation evaluation frameworks measure quality by comparing against ideal "ground-truth" explanations, or by verifying model sensitivity to important inputs. We outline the limitations of these approaches, and propose three desirable principles to ground the future development of explanation evaluation strategies for local feature importance explanations. We propose a ground-truth Agnostic eXplanation Evaluation framework (AXE) for evaluating and comparing model explanations that satisfies these principles. Unlike prior approaches, AXE does not require access to ideal ground-truth explanations for comparison, or rely on model sensitivity - providing an independent measure of explanation quality. We verify AXE by comparing with baselines, and show how it can be used to detect explanation fairwashing. Our code is available at https://github.com/KaiRawal/Evaluating-Model-Explanations-without-Ground-Truth.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

Evaluating Model Explanations without Ground Truth

In the realm of artificial intelligence, particularly as it is applied to decision-making processes of high consequence such as healthcare, finance, and criminal justice, the need for reliable model explanations is critical. This paper by Rawal et al. addresses the challenges associated with evaluating model explanations when there is no access to "ground-truth" data. Current methods of explanation evaluation often rely on comparisons to ideal ground-truth explanations or assess model sensitivity to important inputs. However, these approaches have significant limitations, as they depend on assumptions that may not hold in practical scenarios.

Rawal et al. introduce an alternative framework called ground-truth Agnostic eXplanation Evaluation (AXE), designed to evaluate model explanations by circumventing the need for ground-truth data and model sensitivity reliance. AXE adheres to three principles deemed crucial for effective explanation evaluation—local contextualization, model relativism, and on-manifold evaluation—ensuring that the evaluation remains independent of external determinants like ground-truth comparisons or off-manifold sensitivity perturbation.

The AXE framework focuses on local feature importance explanations, pertinent for models dealing with tabular data, and is built on the premise that a reliable explanation accurately highlights predictive features critical to model outputs. The framework utilizes k-Nearest Neighbors (k-NN) models to emulate model behavior by testing the predictiveness of top-n features deemed important by the explanation. Through this mechanism, AXE evaluates how closely human-interpretable explanations match model behavior over given datasets.

Key findings demonstrate the AXE's efficacy in detecting adversarial fairwashing—where explanations erroneously certify fairness even in models utilizing discriminatory features—as well as its robustness across different datasets and explainers. AXE consistently identifies manipulated explanations, highlighting its superior capability over existing sensitivity-based evaluation strategies like PGI and PGU. Unlike traditional evaluation metrics that may favor specific explainers through inherent similarity in evaluation protocols, AXE delivers impartial assessment grounded in genuine model behavior.

In theoretical and practical implications, AXE represents a significant stride toward accountability and transparency in AI systems. It offers a methodological pivot from conventional sensitivity analyses that rely on assumptions that may obscure the true operative dynamics of models. Moreover, AXE proposes a scalable, agnostic approach to evaluating explanations that could be adapted across diverse AI applications, providing a foundational tool for further developments in explainable AI.

The results suggest a path forward for researchers focused on explainability in AI to prioritize predictiveness and fidelity based on model behavior rather than external assumptions. Future developments might expand AXE's application across other model types and data modalities, including image and text, built upon its principles to tackle explainability challenges across broader AI contexts. As AI systems continue to permeate high-stakes environments, enriching explanation methodologies with frameworks like AXE can fortify the trust and fairness imperative in AI deployment.