Assessing The Factual Accuracy of Generated Text

Published 30 May 2019 in cs.CL | (1905.13322v2)

Abstract: We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce and release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction models. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analyse multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE and other model-free variants by conducting a human evaluation study.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (175)

View on Semantic Scholar

Summary

The paper introduces a model-based metric that leverages a large-scale Wikipedia and Wikidata dataset to assess generated text's factual accuracy.
It employs Transformer-based methods, including relation classification and an end-to-end extraction model, to obtain structured fact tuples.
The evaluation shows the end-to-end model correlates better with human judgment than traditional metrics like ROUGE in summarization tasks.

Assessing The Factual Accuracy of Generated Text: A Technical Overview

The paper "Assessing The Factual Accuracy of Generated Text" introduces a novel approach to evaluating the factual accuracy of generated text, which goes beyond traditional metrics such as ROUGE and BLEU. The authors propose a model-based metric that incorporates a large-scale dataset derived from Wikipedia and Wikidata, facilitating relation classification and end-to-end fact extraction. This work is centered around the critical evaluation of factual correctness in text summarization tasks, where the accuracy of information presented is paramount.

Methodology

The paper outlines several methods for extracting factual data from text:

Relation Classifier: This method uses named entity recognition (NER) followed by relation classification to predict relationships between identified entities. It employs a Transformer-based architecture, adapting the principles of attention mechanisms to facilitate precise relation prediction.
End-to-End Extraction Model: To circumvent potential errors accumulating across multiple stages of the fact extraction pipeline, the authors propose an end-to-end model using a sequence-to-sequence Transformer architecture. This model extracts structured fact tuples directly from input text without intermediary stages like entity recognition, thereby generating more consistent and reliable outputs.
Binary Relation Classifier: A binary classification approach determines whether two entities are related, focusing on the presence or absence of a relation rather than the specific type, providing flexibility in auditing text factuality.

The dataset supporting these models is constructed using distant supervision employing Wikidata as the underlying knowledge base. The dataset is larger and more diverse than previous datasets, encompassing multiple domains and relation types.

Evaluation

The proposed model-based metrics were rigorously tested against model-free metrics such as ROUGE and OpenIE. Through human evaluation experiments, where human annotators rated the factual accuracy of text summaries, it was demonstrated that the end-to-end model's correlation with human judgment was superior to that of both ROUGE and other model-based metrics.

Results from experiments revealed that:

The End-to-End model outperformed traditional classifiers in terms of precision while providing structured outputs suitable for logical reasoning.
The factual accuracy metric derived from the End-to-End model exhibited higher correlation with human evaluation standards than ROUGE scores, especially in domains such as actors where categorical relations are common.

Implications

From a practical standpoint, the ability to automatically and accurately assess the factual content of generated text has significant implications for AI-driven content generation systems, especially those involved in summarization tasks. Developers and researchers now have the prospect of employing a reliable factual accuracy metric that is more aligned with human expectations.

Theoretically, this work shapes future studies by emphasizing the importance of factual correctness in natural language generation and paving the way for robust end-to-end relationship extraction models in various domains, potentially extending to dialogue systems and real-time information synthesis.

Future Directions

The authors recognize limitations pertaining to the dataset's dependency on Wikipedia's writing style and Wikidata's completeness, suggesting avenues for enriching the labeling scheme and incorporating diverse text sources. Further development could aim at extending the model's applicability across languages and domains, and enhancing the knowledge base's breadth.

Ultimately, advancing the methodologies for fact extraction and evaluation in NLP systems is crucial for generating text that users can trust. This paper sets the groundwork for such methodologies, offering tools and data that are open for expansion and improvement in subsequent research efforts.

Markdown Report Issue