MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

Published 16 Apr 2024 in cs.CL and cs.AI | (2404.10774v2)

Abstract: Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of fact-checking are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to a model to check a single response. In this work, we show how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify datasets from recent work on fact-checking and grounding LLM generations into a new benchmark, LLM-AggreFact. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.

Abstract PDF HTML Upgrade to Chat

References (73)

Citations (46)

View on Semantic Scholar

Summary

The paper presents MiniCheck, an efficient fact-checking framework for LLM-generated content that rivals GPT-4’s accuracy at 400 times lower cost.
It employs a synthetic dataset enriched with factual errors and fine-tunes the Flan-T5 architecture to reliably verify grounding in text.
The study introduces LLM-AggreFact, a comprehensive benchmark that validates MiniCheck’s balanced accuracy across diverse domains.

Efficient and Effective Fact-Checking for Grounding LLM Generations

Introduction

LLMs hold remarkable capacities for generating fluent and contextually relevant text across a myriad of tasks including document summarization, dialogue generation, and more. Nevertheless, these models often falter by producing content that, while seemingly plausible, may not be factually corroborated by evidence — a phenomenon known as "hallucination." Addressing this challenge, especially in a scalable and cost-effective manner, remains of interest within the field of NLP.

The present work introduces an innovative methodology that significantly mitigates the computational and financial overhead involved in LLM-based fact-checking without compromising on performance quality. By crafting a novel synthetic dataset that mimics complex instances of factual inaccuracies and leveraging this dataset to train a smaller model architecture, the authors showcase a system, MiniCheck, that rivals the accuracy of GPT-4 while operating at 400 times lower cost.

Fact-Checking Model Integration

MiniCheck, the proposed system, exemplifies a notable leap in addressing the limitations of prior fact-checking approaches. At its core, MiniCheck employs a sophisticated training regimen using synthetic data that is purposefully designed to include a range of factual inaccuracies. This data simulates the multifaceted nature of errors LLMs might generate, from misinterpretations to outright factual mistakes, across sentences that demand multi-sentence reasoning for verification.

The structure of MiniCheck is grounded in the Flan-T5 architecture, enriched through fine-tuning on the synthetic dataset alongside tailoring to standard entailment tasks. This methodological choice ensures that MiniCheck not only grasps the nuances of LLM-generated text but also aligns with the broader entailment detection capabilities required for effective fact-checking.

LLM-AggreFact: A New Factual Evaluation Benchmark

To benchmark the proficiency of fact-checking models, including MiniCheck, the study introduces LLM-AggreFact — a comprehensive dataset amalgamating various tasks that necessitate evidence grounding. This benchmark encompasses a diverse array of domains from healthcare to news, alongside a mixture of closed-book and grounded generation settings, offering a rigorous testing ground for fact-checking systems.

Evaluation on LLM-AggreFact reveals that MiniCheck outperforms previous systems by a significant margin in terms of balanced accuracy. Specifically, MiniCheck-FT5, with 770M parameters, showcases comparative accuracy to GPT-4 while being significantly more efficient in terms of both speed and cost.

Implications and Future Directions

The findings presented carry both practical and theoretical implications for the development and deployment of LLMs. Practically, MiniCheck offers a viable solution for integrating robust fact-checking mechanisms into LLM applications without incurring prohibitive costs. Theoretically, the use of synthetic data for training fact-checkers opens new avenues for model training, particularly in scenarios where error types are complex and diverse.

Speculatively, as LLMs continue to evolve, the role of efficient and effective fact-checking will undeniably become more critical. Future research may explore extending the MiniCheck approach to multilingual settings, addressing the challenge of multi-document reasoning for comprehensive fact-checking, and further optimizing the trade-off between model size, accuracy, and operational costs.

Conclusion

Through meticulous methodology, synthetic data generation, and comprehensive benchmarking, this work advances the state of fact-checking for LLM-generated content. MiniCheck demonstrates that precision in fact-checking can be achieved without the constraints of high computational costs, offering a forward-looking solution for researchers and practitioners aiming to enhance the reliability of LLM outputs across a spectrum of applications.