Know When To Stop: A Study of Semantic Drift in Text Generation

Published 8 Apr 2024 in cs.CL | (2404.05411v1)

Abstract: In this work, we explicitly show that modern LLMs tend to generate correct facts first, then "drift away" and generate incorrect facts later: this was occasionally observed but never properly measured. We develop a semantic drift score that measures the degree of separation between correct and incorrect facts in generated texts and confirm our hypothesis when generating Wikipedia-style biographies. This correct-then-incorrect generation pattern suggests that factual accuracy can be improved by knowing when to stop generation. Therefore, we explore the trade-off between information quantity and factual accuracy for several early stopping methods and manage to improve factuality by a large margin. We further show that reranking with semantic similarity can further improve these results, both compared to the baseline and when combined with early stopping. Finally, we try calling external API to bring the model back to the right generation path, but do not get positive results. Overall, our methods generalize and can be applied to any long-form text generation to produce more reliable information, by balancing trade-offs between factual accuracy, information quantity and computational cost.

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel Semantic Drift Score to quantitatively measure the loss of factual accuracy in text generated by language models.
It details mitigation strategies, including early stopping and resampling-then-reranking, to balance factual integrity with content volume.
Empirical tests on LLaMa2 variants reveal a recurring pattern of accuracy followed by inaccuracy, underscoring the need for intrinsic model improvements.

Semantic Drift in Text Generation: Measurement, Analysis, and Mitigation

Introduction: Defining Semantic Drift

Semantic drift in text generation by LLMs (LMs) describes the divergence of generated text from the intended subject matter, leading to detriments in relevance, coherence, or truthfulness. This phenomenon, albeit observed, had not been rigorously quantified prior to this study. Our research introduces a novel metric, the Semantic Drift (SD) score, to measure this drift, particularly focusing on the transition from correct to incorrect fact generation. Our findings indicate that, in generating Wikipedia-style biographies, several variants of LLaMa2 exhibit significant semantic drift, initiating correct fact generation and progressively deviating to produce inaccuracies.

Quantifying Semantic Drift

The essence of our approach lies in the innovative Semantic Drift Score, designed to quantify the degree of accuracy deterioration in generated text. Through this, we observed a pronounced pattern in semantic drift among tested LLMs, laying empirical grounds for exploring mitigation strategies aimed at enhancing factual accuracy. Notably, our experiments on erradicating semantic drift explore mechanisms ranging from simple early stopping measures to more complex arrangements, such as resampling-then-reranking pipelines and the unsuccessful attempt of API intervention to rectify the drift course.

Implications of Semantic Drift

Our analysis extends into the implications of semantic drift on text generation quality, providing insights into its manifestations across various LLMs. Despite improvements in overall factual accuracy linked to model scaling, the persistence of semantic drift across different scales indicates a foundational challenge within the generative process of these models. The recurrent pattern of accuracy then inaccuracy in generated facts underscores a critical area for enhancing LM capabilities, with our methods offering practical yet foundational strategies for mitigating semantic drift.

Mitigating Semantic Drift

Our exploration into mitigating strategies for semantic drift introduces a dual pathway: early stopping and resampling-then-reranking. Early stopping, guided by the model's prediction confidence, shows promise in reducing inaccuracies, albeit with a trade-off in content volume. Conversely, the resample-then-rerank strategy, delineated by sentence similarity measures, presents a viable method to maintain content volume while enhancing factual accuracy. However, the application of external API calls, aiming to reorient the model towards accurate generation paths, revealed minimal effectiveness, thus guiding future studies towards intrinsic model adjustments and predictive stopping measures.

Future Directions

The study sets a precedent for quantitively assessing semantic drift in text generation, offering methodologies that balance computational efficiency with factual accuracy. Looking ahead, further research into early stopping signals and refinement of reranking methodologies holds potential to advance the reliability of text generation models. Moreover, extending the analysis across diverse text genres and model architectures could unearth broader insights into the inherent mechanisms of semantic drift, catalyzing the development of more robust and accurate generative LLMs.

Concluding Remarks

In summary, this study not only quantitatively establishes the phenomenon of semantic drift in LLM-generated text but also introduces effective strategies for mitigating its impact. While the challenge of semantic drift remains substantial, our research provides a coherent framework and actionable insights to navigate this complexity, marking a significant step forward in the quest for reliable and accurate text generation.