Human Bias in the Face of AI: The Role of Human Judgement in AI Generated Text Evaluation

Published 29 Sep 2024 in cs.CL, cs.AI, and cs.HC | (2410.03723v1)

Abstract: As AI advances in text generation, human trust in AI generated content remains constrained by biases that go beyond concerns of accuracy. This study explores how bias shapes the perception of AI versus human generated content. Through three experiments involving text rephrasing, news article summarization, and persuasive writing, we investigated how human raters respond to labeled and unlabeled content. While the raters could not differentiate the two types of texts in the blind test, they overwhelmingly favored content labeled as "Human Generated," over those labeled "AI Generated," by a preference score of over 30%. We observed the same pattern even when the labels were deliberately swapped. This human bias against AI has broader societal and cognitive implications, as it undervalues AI performance. This study highlights the limitations of human judgment in interacting with AI and offers a foundation for improving human-AI collaboration, especially in creative fields.

Abstract PDF HTML Upgrade to Chat

Summary

The paper demonstrates that human evaluators preferred text labeled as 'Human Generated' by over 30%, despite no quality differences.
It employs blind experiments in rephrasing, summarization, and persuasive writing using Amazon Mechanical Turk to compare AI and human text.
The findings highlight implications for human-AI collaboration, urging the development of transparent and explainable AI systems to mitigate biases.

Human Bias in the Evaluation of AI-Generated Text

The paper "Human Bias in the Face of AI: The Role of Human Judgement in AI Generated Text Evaluation" by Zhu et al. investigates the human biases that influence the perception and evaluation of AI-generated text. This research is of particular significance as LLMs increasingly permeate everyday tasks involving text generation, yet are met with skepticism from users who perceive AI-generated content differently than human-produced text.

Key Findings

Through three distinct experimental scenarios—text rephrasing, summarization, and persuasive writing—the paper reveals that human evaluators under blind conditions could not consistently distinguish between human and AI-generated text. Despite this difficulty, evaluators displayed a notable preference for text labeled as "Human Generated" by a margin exceeding 30%, even when the labels were incorrect. Such biases suggest a perception issue where AI-generated content is undervalued simply due to its origin, highlighting substantial implications for human-AI collaboration.

Methodology

The study employed a series of experiments, using Amazon Mechanical Turk to gather initial human assessments across three scenarios. Data was collected from AI models (ChatGPT-4, Claude 2, Llama 3.1) and compared against human-generated text of similar length. Preference scores from raters were used to determine how bias influenced evaluations when texts were explicitly labeled, manipulated, or blindly labeled.

Implications

The findings point towards cognitive and societal biases that undervalue AI outputs despite their comparable quality to human-produced content. These biases pose challenges not only in the broader acceptance of AI technologies but also in AI system training processes like Reinforcement Learning from Human Feedback (RLHF). Understanding these biases can lead to more effective strategies for deploying AI technologies in human-centric scenarios.

Practical Applications and Future Directions

To promote better human-AI collaboration, the study advocates for transparency and clarity in AI operations, perhaps through developing explainable AI systems. Furthermore, positioning AI as an assistive, rather than competitive, technology may help mitigate biases. Future research may explore how biases manifest in other creative domains or tackle the psychological roots of such biases, potentially attributed to a resistance to conceding the human domain of creativity to AI.

Limitations

Potential limitations include the demographic restrictions associated with using MTurk and the scenarios specific to writing tasks. Expanding future studies to encompass diverse demographic samples and additional creative contexts could yield a more comprehensive understanding.

In conclusion, this study exposes a notable discrepancy in human evaluations of AI-generated text, rooted in bias rather than content quality. This insight into human judgment enriches the discourse on how AI technology is integrated and accepted across various domains of human activity.