Gender Bias in Explainability: Investigating Performance Disparity in Post-hoc Methods
The paper "Gender Bias in Explainability: Investigating Performance Disparity in Post-hoc Methods" presents an empirical analysis of gender disparities in post-hoc explainability methods applied to pre-trained language models (PLMs). This research addresses critical unexplored aspects related to fairness in AI, particularly focusing on disparities in the quality of explanations provided by commonly used post-hoc feature attribution methods. The study is centered around evaluating explanation methods for their faithfulness, robustness, and complexity across gender subgroups.
Research Context and Objectives
Explainable AI has gained substantial interest, aiming to improve the transparency and accountability of complex models, including PLMs in NLP tasks. While explainability methods attempt to demystify how models process and predict based on input data, fairness in the quality of these explanations across demographic groups has been relatively underexplored. This paper endeavors to fill this gap by systematically examining whether post-hoc explanation methods perform equitably across different gender groups.
The researchers focus on investigating post-hoc explanation methods: Gradient, Integrated Gradients (IG), SHAP, LIME, and their variants in terms of faithfulness, robustness, and complexity on distinct PLMs. To facilitate the evaluation, the study utilizes metrics like comprehensiveness, sufficiency, sparsity, and sensitivity.
Methodology
The authors conduct experiments using five language models: BERT, TinyBERT, GPT-2, RoBERTa-large, and FairBERTa, assessing explanation disparities across four datasets where gender plays a significant role. They explore five explanation methods through several metrics to quantify disparities. Statistical tests are employed to determine if discrepancies in explanation qualities between male and female datasets are significant.
Key Findings
- Disparity Presence: Across more than 54% of the cases, post-hoc explanation methods exhibit significant gender disparities in explanation quality, with certain methods like IGxI, SHAP, and LIME showing higher levels of unfairness.
- Impact Across Metrics: Disparities varied across all metrics. Notably, sensitivity showed the highest disparity, suggesting that perturbation-based methods are particularly susceptible to variations in gender contexts.
- Dataset Influence: Datasets with direct gender labels (GECO) highlighted stronger disparities compared to datasets where gender played a more implicit role (COMPAS), reflecting the dependency of bias on the nature of the task.
- Training Influence: The disparity persisted even when models were trained from scratch on gender-neutral datasets, indicating inherent biases in the explanation methods themselves rather than stemming purely from biased datasets or model training.
Implications and Recommendations
The findings underscore the importance for stakeholders — developers, researchers, and policymakers — to ensure that explanation methods are scrutinized not just for their ability to interpret models but also for fairness across demographic subgroups. This paper suggests:
- Framework Adaptation: Incorporating fairness checks into explainability frameworks and regulatory AI guidelines, such as those in the EU AI Act, ensuring compliance and reducing liability risks in practical applications.
- Focus on Novel Solutions: Researchers are encouraged to consider gender disparities explicitly when devising new explainability techniques, recommending a shift towards fairness-centric innovations in the field.
- Future Research Directions: Explore solutions to mitigate explanation disparities, possibly developing hybrid models or novel metrics that consider fairness as foundational in post-hoc explainability.
Conclusion
The study marks a significant advancement in understanding the fairness of explanation methods, revealing that current post-hoc strategies may inadvertently propagate bias across gender lines. By providing a comprehensive pipeline for evaluating explanation disparities, the paper paves the way for further exploration into more equitable AI interpretability methods. This research implies that addressing fairness in AI explanations is as critical as ensuring fairness in models themselves, promoting a holistic approach to equitable AI.