Gender Bias in Explainability: Investigating Performance Disparity in Post-hoc Methods

Published 2 May 2025 in cs.CL, cs.AI, and cs.LG | (2505.01198v1)

Abstract: While research on applications and evaluations of explanation methods continues to expand, fairness of the explanation methods concerning disparities in their performance across subgroups remains an often overlooked aspect. In this paper, we address this gap by showing that, across three tasks and five LLMs, widely used post-hoc feature attribution methods exhibit significant gender disparity with respect to their faithfulness, robustness, and complexity. These disparities persist even when the models are pre-trained or fine-tuned on particularly unbiased datasets, indicating that the disparities we observe are not merely consequences of biased training data. Our results highlight the importance of addressing disparities in explanations when developing and applying explainability methods, as these can lead to biased outcomes against certain subgroups, with particularly critical implications in high-stakes contexts. Furthermore, our findings underscore the importance of incorporating the fairness of explanations, alongside overall model fairness and explainability, as a requirement in regulatory frameworks.

Abstract PDF Upgrade to Chat

Summary

Gender Bias in Explainability: Investigating Performance Disparity in Post-hoc Methods

The paper "Gender Bias in Explainability: Investigating Performance Disparity in Post-hoc Methods" presents an empirical analysis of gender disparities in post-hoc explainability methods applied to pre-trained language models (PLMs). This research addresses critical unexplored aspects related to fairness in AI, particularly focusing on disparities in the quality of explanations provided by commonly used post-hoc feature attribution methods. The study is centered around evaluating explanation methods for their faithfulness, robustness, and complexity across gender subgroups.

Research Context and Objectives

Explainable AI has gained substantial interest, aiming to improve the transparency and accountability of complex models, including PLMs in NLP tasks. While explainability methods attempt to demystify how models process and predict based on input data, fairness in the quality of these explanations across demographic groups has been relatively underexplored. This paper endeavors to fill this gap by systematically examining whether post-hoc explanation methods perform equitably across different gender groups.

The researchers focus on investigating post-hoc explanation methods: Gradient, Integrated Gradients (IG), SHAP, LIME, and their variants in terms of faithfulness, robustness, and complexity on distinct PLMs. To facilitate the evaluation, the study utilizes metrics like comprehensiveness, sufficiency, sparsity, and sensitivity.

Methodology

The authors conduct experiments using five language models: BERT, TinyBERT, GPT-2, RoBERTa-large, and FairBERTa, assessing explanation disparities across four datasets where gender plays a significant role. They explore five explanation methods through several metrics to quantify disparities. Statistical tests are employed to determine if discrepancies in explanation qualities between male and female datasets are significant.

Key Findings

Disparity Presence: Across more than 54% of the cases, post-hoc explanation methods exhibit significant gender disparities in explanation quality, with certain methods like IGxI, SHAP, and LIME showing higher levels of unfairness.
Impact Across Metrics: Disparities varied across all metrics. Notably, sensitivity showed the highest disparity, suggesting that perturbation-based methods are particularly susceptible to variations in gender contexts.
Dataset Influence: Datasets with direct gender labels (GECO) highlighted stronger disparities compared to datasets where gender played a more implicit role (COMPAS), reflecting the dependency of bias on the nature of the task.
Training Influence: The disparity persisted even when models were trained from scratch on gender-neutral datasets, indicating inherent biases in the explanation methods themselves rather than stemming purely from biased datasets or model training.

Implications and Recommendations

The findings underscore the importance for stakeholders — developers, researchers, and policymakers — to ensure that explanation methods are scrutinized not just for their ability to interpret models but also for fairness across demographic subgroups. This paper suggests:

Framework Adaptation: Incorporating fairness checks into explainability frameworks and regulatory AI guidelines, such as those in the EU AI Act, ensuring compliance and reducing liability risks in practical applications.
Focus on Novel Solutions: Researchers are encouraged to consider gender disparities explicitly when devising new explainability techniques, recommending a shift towards fairness-centric innovations in the field.
Future Research Directions: Explore solutions to mitigate explanation disparities, possibly developing hybrid models or novel metrics that consider fairness as foundational in post-hoc explainability.

Conclusion

The study marks a significant advancement in understanding the fairness of explanation methods, revealing that current post-hoc strategies may inadvertently propagate bias across gender lines. By providing a comprehensive pipeline for evaluating explanation disparities, the paper paves the way for further exploration into more equitable AI interpretability methods. This research implies that addressing fairness in AI explanations is as critical as ensuring fairness in models themselves, promoting a holistic approach to equitable AI.