How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations

Published 1 Mar 2025 in cs.CV | (2503.00641v1)

Abstract: Post-hoc importance attribution methods are a popular tool for "explaining" Deep Neural Networks (DNNs) and are inherently based on the assumption that the explanations can be applied independently of how the models were trained. Contrarily, in this work we bring forward empirical evidence that challenges this very notion. Surprisingly, we discover a strong dependency on and demonstrate that the training details of a pre-trained model's classification layer (less than 10 percent of model parameters) play a crucial role, much more than the pre-training scheme itself. This is of high practical relevance: (1) as techniques for pre-training models are becoming increasingly diverse, understanding the interplay between these techniques and attribution methods is critical; (2) it sheds light on an important yet overlooked assumption of post-hoc attribution methods which can drastically impact model explanations and how they are interpreted eventually. With this finding we also present simple yet effective adjustments to the classification layers, that can significantly enhance the quality of model explanations. We validate our findings across several visual pre-training frameworks (fully-supervised, self-supervised, contrastive vision-language training) and analyse how they impact explanations for a wide range of attribution methods on a diverse set of evaluation metrics.

Abstract PDF Upgrade to Chat

Summary

The paper reveals that using BCE loss in probe training substantially improves the localization and stability of post-hoc attributions.
It demonstrates how probe architecture complexity, including non-linear MLP probes, enhances both classification accuracy and representation fidelity.
The study emphasizes revising probe training protocols to bridge the gap between model performance and reliable interpretability.

How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations

This paper investigates the impact of training methodologies on post-hoc explanation quality in deep neural networks (DNNs). It empirically demonstrates the critical yet largely unexplored role of training protocols, specifically the classification layer of DNNs, for obtaining reliable model attributions, a finding that has significant implications for interpretable AI.

Introduction

Post-hoc attribution methods localize features pertinent to a DNN's decisions, offering a bridge in interpreting these often opaque models. Despite their commonplace usage, understanding their dependence on model training details has been insufficiently explored. The authors bring novel insights showing that the model's classification head, although forming a mere fraction of the network's parameters, dramatically affects the fidelity of generated explanations. The paper highlights this dependency across various attribution methods and pre-training paradigms, suggesting adjustments to the head's configuration that remarkably bolster explanation quality.

Figure 1: Impact of Loss (BCE vs.~CE). (a) EPG Scores, and (b) Pixel Deletion scores for Bcos and LRP attributions. BCE probes enhance localization and stability.

Methodology

The paper's methodological contributions lie in its rigorous empirical analyses conducted across distinct tasks and evaluation metrics involving attribution quality. It approaches this through linear and multi-layer perceptron (MLP) probes applied to frozen pre-trained features from DNNs, scrutinizing the resulting attributions through several interpretability metrics, notably localization via grid and energy pointing games.

Setup and Evaluation

The primary hypothesis tested is the significant role of the Binary Cross-Entropy (BCE) loss over Cross-Entropy (CE) for probes, hypothesized to alleviate softmax-induced logit-shift issues inherent to CE. The evaluation is extensive, including a suite of metrics such as pixel deletion for robustness and entropy for compactness, affirming consistent quality improvements in explanations when employing BCE.

Figure 2: Setup: Step 1. Linear or MLP probes are trained. Step 2. Explanation methods are applied.

Results

BCE vs. CE Loss

The findings confirm that BCE-optimized probes systematically outperform their CE counterparts, yielding more localized attributions across a gamut of pre-training regimes including supervised learning, self-supervised learning (SSL), and vision-language training. This superiority manifests in higher grid and energy pointing scores, along with increased robustness to pixel perturbations.

Complexity of Probes

Moreover, utilizing non-linear, specifically B-cos, MLP probes demonstrated further boosts in both classification accuracy and representational localization, underscoring the advantage of employing a more complex but interpretable architecture for probing.

Figure 3: BCE vs.~CE. B-cos attributions show more localized effects with BCE-trained linear probes compared to CE.

Discussion

The implications are profound: choice of training loss at the probe layer can enhance downstream interpretability without altering the core backbone model. These results argue for a reconsideration of how interpretability assessments are normally decoupled from training considerations—suggesting tighter integration could lead to both performance and interpretation improvements.

Conclusion

This study unveils the underappreciated yet pivotal role of probe training configurations in deriving high-quality DNN attributions. By systematically exploring the effect of loss functions and probe complexity, it provides both theoretical and practical insights into optimizing AI systems for interpretability. Future work may leverage these findings to craft explanation-aware training protocols, further bridging the gap between model performance and transparency.

Markdown Report Issue