Hypothesis-only Biases in Large Language Model-Elicited Natural Language Inference

Published 11 Oct 2024 in cs.CL | (2410.08996v1)

Abstract: We test whether replacing crowdsource workers with LLMs to write Natural Language Inference (NLI) hypotheses similarly results in annotation artifacts. We recreate a portion of the Stanford NLI corpus using GPT-4, Llama-2 and Mistral 7b, and train hypothesis-only classifiers to determine whether LLM-elicited hypotheses contain annotation artifacts. On our LLM-elicited NLI datasets, BERT-based hypothesis-only classifiers achieve between 86-96% accuracy, indicating these datasets contain hypothesis-only artifacts. We also find frequent "give-aways" in LLM-generated hypotheses, e.g. the phrase "swimming in a pool" appears in more than 10,000 contradictions generated by GPT-4. Our analysis provides empirical evidence that well-attested biases in NLI can persist in LLM-generated data.

Abstract PDF HTML Upgrade to Chat

Summary

The paper demonstrates that hypothesis-only classifiers can achieve 86-96% accuracy on LLM-generated NLI datasets, revealing significant annotation artifacts.
It replicates SNLI methodologies using GPT-4, Llama-2, and Mistral 7b to compare the bias patterns between LLM- and human-generated hypotheses.
Findings indicate that LLMs inherit bias patterns similar to human data, emphasizing the need for enhanced dataset curation and bias mitigation strategies.

Analysis of Hypothesis-only Biases in LLM-Generated NLI Data

The paper "Hypothesis-only Biases in LLM-Elicited Natural Language Inference" addresses a crucial issue regarding the biases present in Natural Language Inference (NLI) datasets generated by LLMs. The researchers, Grace Proebsting and Adam Poliak, investigate the presence of annotation artifacts in such datasets and assess their implications on hypothesis-only classification models.

Experimentation and Methodology

The study focuses on replicating a section of the Stanford NLI (SNLI) corpus using prominent LLMs such as GPT-4, Llama-2, and Mistral 7b. By employing the same set of instructions used with human crowd-sourced workers, the researchers generated hypotheses corresponding to given premises. This approach allowed for a controlled comparison between human- and LLM-generated data.

Once the datasets were created, hypothesis-only classifiers based on Naive Bayes and BERT-based models were trained to determine if they could predict the NLI label using only the hypothesis, without the premise. The classifiers achieved notable accuracy, ranging from 86% to 96% on LLM-generated datasets, indicating substantial presence of annotation artifacts that could potentially bias results.

Key Findings

Presence of Annotation Artifacts: The study found that LLM-generated NLI datasets do contain annotation artifacts similar to those found in human-generated datasets. The high accuracy of hypothesis-only classifiers substantiates this claim.
Common Give-Away Words: Certain phrases, such as those seen frequently in the dataset analysis, appear disproportionately within specific labels, showing strong indicative power. For example, "swimming in a pool" appeared in over 10,000 contradiction samples generated by GPT-4.
Model Bias Similarity: Interestingly, the hypothesis-only models trained on SNLI data performed better on GPT-4 datasets than on SNLI itself, hinting at potentially comparable biases across these datasets. Additionally, LLMs tended to exhibit similar patterns of biases.

Implications and Future Directions

The implications of this research are multifaceted. Practically, it indicates a need for comprehensive quality control and dataset filtering when utilizing LLMs for generating NLP datasets. Theoretically, the findings suggest that LLMs, while efficient, may inherit systematic biases from their training processes, which can degrade the quality and reliability of NLP applications.

Looking forward, research could explore methods to mitigate these biases, such as enhancing prompt engineering, incorporating more diverse data, or developing advanced filtering techniques post-generation. Additionally, understanding the root causes of these biases and developing models with reduced susceptibility to such artifacts could be potential areas of advancing Artificial Intelligence research.

Conclusion

The paper provides a meticulous examination of the biases inherent in LLM-generated NLI datasets, offering insights that are critical for both the utilization of LLMs in NLP and the broader understanding of bias propagation in AI systems. The results endorse the necessity of ongoing vigilance and innovation in dataset curation and model training methodologies.

Markdown Report Issue