Towards Debiasing Fact Verification Models

Published 14 Aug 2019 in cs.CL | (1908.05267v2)

Abstract: Fact verification requires validating a claim in the context of evidence. We show, however, that in the popular FEVER dataset this might not necessarily be the case. Claim-only classifiers perform competitively with top evidence-aware models. In this paper, we investigate the cause of this phenomenon, identifying strong cues for predicting labels solely based on the claim, without considering any evidence. We create an evaluation set that avoids those idiosyncrasies. The performance of FEVER-trained models significantly drops when evaluated on this test set. Therefore, we introduce a regularization method which alleviates the effect of bias in the training data, obtaining improvements on the newly created test set. This work is a step towards a more sound evaluation of reasoning capabilities in fact verification models.

Abstract PDF Upgrade to Chat

Citations (199)

View on Semantic Scholar

Summary

The paper identifies that fact verification models can exploit dataset biases, with claim-only models reaching 61.7% accuracy without evidence.
It introduces a Symmetric Test Set by generating balanced claim-evidence pairs to rigorously evaluate and challenge biased model behavior.
A novel regularization method is proposed to attenuate the impact of 'give-away' phrases, thereby enhancing model generalization on unbiased data.

Evaluation and Mitigation of Bias in Fact Verification Models

The paper "Towards Debiasing Fact Verification Models" addresses a critical evaluation limitation present in existing fact verification datasets, notably the Fact Extraction and Verification (FEVER) dataset. The authors contend that the FEVER dataset exhibits significant biases which can be exploited by models, allowing them to perform fact verification tasks without truly engaging with the evidence. This paper explores this phenomenon, presents an approach for creating an unbiased evaluation dataset, and introduces a method to mitigate biases during model training.

The core issue identified is that many datasets, like FEVER, can lead models to rely on dataset-specific idiosyncrasies rather than robust reasoning capabilities aligned with human fact-checking. Specifically, the paper demonstrates that a claim-only model, without any evidence, can achieve surprisingly strong results (61.7% accuracy) in the FEVER task, compared to the expected baseline of 33.3%. This is largely attributed to 'give-away' phrases present in the claims, which correlate strongly with specific labels (e.g., 'Refutes').

To rigorously evaluate this tendency, the authors propose the creation of a Symmetric Test Set. This set involves generating synthetically balanced examples that prevent reliance on idiosyncratic cues; each original claim-evidence pair is complemented with a counterpart that reverses the claim's veracity with modified evidence, leading to four symmetric pairs. Notable findings show that evidence-aware classifiers experience a marked performance drop when tested on these symmetrical examples, highlighting their reliance on biases rather than contextual understanding.

A significant technical contribution of this paper is the proposition of a novel regularization method aimed at reducing model sensitivity to biased phrases during training. The proposed method assigns weights to training examples such that phrases highly indicative of one label are balanced with counterexamples, effectively attenuating the correlation between these phrases and particular class labels. Experimental evidence suggests that this mitigation strategy, demonstrated using models like ESIM and BERT, improves generalization to unbiased examples, albeit slightly sacrificing performance on the biased train and development sets.

This paper, thereby, not only outlines critical vulnerabilities in commonly-used datasets but also articulates a two-pronged approach to addressing these: creating a more rigorous evaluation framework through the Symmetric Test Set and enhancing model robustness via targeted regularization. Theoretical implications extend to other natural language processing tasks with similar dataset biases, suggesting pathways for more reliable evaluation and improvement of interpretative capabilities in neural models. Practically, these insights direct future developments toward more nuanced dataset design and model training methodologies that emphasize evidence-based reasoning.

Future work may explore extending the symmetric test set approach across larger datasets and varied domains, assessing the robustness and transferability of the proposed regularization method under different fact-checking requirements. This research contributes meaningfully to the discourse on the integrity of AI systems tasked with language understanding and fact verification, advancing towards models that not only perform well numerically but align closer with human reasoning expectations.