Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study

Published 10 Apr 2024 in cs.CL and cs.LG | (2404.07060v1)

Abstract: We present an empirical study of groundedness in long-form question answering (LFQA) by retrieval-augmented LLMs. In particular, we evaluate whether every generated sentence is grounded in the retrieved documents or the model's pre-training data. Across 3 datasets and 4 model families, our findings reveal that a significant fraction of generated sentences are consistently ungrounded, even when those sentences contain correct ground-truth answers. Additionally, we examine the impacts of factors such as model size, decoding strategy, and instruction tuning on groundedness. Our results show that while larger models tend to ground their outputs more effectively, a significant portion of correct answers remains compromised by hallucinations. This study provides novel insights into the groundedness challenges in LFQA and underscores the necessity for more robust mechanisms in LLMs to mitigate the generation of ungrounded content.

Abstract PDF HTML Upgrade to Chat

References (52)

Citations (2)

View on Semantic Scholar

Summary

The paper provides an empirical analysis showing that even large LLMs generate up to 25% ungrounded, hallucinated content.
The study rigorously assesses decoding strategies and instruction tuning, finding that beam search consistently improves grounding in generated responses.
The research highlights the need for enhanced retrieval-augmented frameworks and fine-tuning techniques to mitigate hallucinations in long-form QA applications.

Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study

The paper under discussion provides a methodical exploration of groundedness in retrieval-augmented LLMs tasked with Long-form Question Answering (LFQA). This research sheds light on the intricate issue of whether LLMs can faithfully ground their generated responses in provided documents, or if they default to hallucinations even when producing answers that tally with ground-truth data.

The study meticulously evaluates the grounding of individual sentences within model outputs across multiple datasets and model families, focusing on a nuanced distinction between grounding in retrieved documents versus the vastness of pre-training corpora. Notably, the study underscores that even when LLMs generate factually correct sentences, a considerable portion remains ungrounded in the provided or pre-training materials. This raises critical questions about the internal mechanisms that enable or inhibit proper grounding in such models.

Significant findings include the revelation that larger models generally produce more grounded responses. Nevertheless, the results demonstrate that model size alone is insufficient to eliminate ungrounded statements entirely. This issue persists even in the largest models explored, such as Falcon 180B, where up to 25% of seemingly correct outputs are derived from hallucinated content. The dependency on strategies beyond increasing model size becomes apparent—a crucial insight for practitioners aiming to enhance LLM reliability.

Moreover, the effect of different factors like decoding strategies and instruction tuning on groundedness was rigorously assessed. The results indicate that beam search decoding, unlike greedy or nucleus sampling, consistently yielded outputs that are more anchored in the provided context, suggesting that this strategy may inherently facilitate better content alignment with source materials. Instruction tuning also appears to have a positive role, enhancing the groundedness of models considerably.

From a methodological standpoint, the authors adapted a mixed retrieval strategy, combining retrieval from external documents with a post-generation search across the pre-training corpus. The analysis employs an inference-based grounding model to check whether model outputs could be empirically supported, either by retrieved or pre-trained corpus documents, illuminating the intertwined relationship between groundedness and model pre-training.

In discussing the broader implications, the paper emphasizes the necessity for more robust mitigation mechanisms against hallucination in LLMs. The potential of developing more sophisticated retrieval-augmented frameworks or fine-tuning strategies to enhance sentence-level alignment with factual data is immense and could significantly impact applications demanding high veracity, such as academic research synthesis and automated Q&A systems.

Looking forward, the research opens avenues for extensive exploration into specialized decoding strategies or post-processing corrections to check grounding, aiming for refined methodologies that can effectively counteract the inherent limitations of existing LLMs. Practitioners might find benefit in exploring adaptive models that dynamically verify grounding during generation, rather than post hoc.

In conclusion, this study provides empirical grounding to the challenges and dynamics of grounded content generation in retrieval-augmented LLMs. It underscores the essential nature of continued research and development in this domain to ensure the dependable deployment of powerful LLMs across various real-world applications.