Can Perplexity Reflect Large Language Model's Ability in Long Text Understanding?

Published 9 May 2024 in cs.CL | (2405.06105v1)

Abstract: Recent studies have shown that LLMs have the potential to process extremely long text. Many works only evaluate LLMs' long-text processing ability on the language modeling task, with perplexity (PPL) as the evaluation metric. However, in our study, we find that there is no correlation between PPL and LLMs' long-text understanding ability. Besides, PPL may only reflect the model's ability to model local information instead of catching long-range dependency. Therefore, only using PPL to prove the model could process long text is inappropriate. The local focus feature of PPL could also explain some existing phenomena, such as the great extrapolation ability of the position method ALiBi. When evaluating a model's ability in long text, we might pay more attention to PPL's limitation and avoid overly relying on it.

Abstract PDF Upgrade to Chat

Citations (15)

View on Semantic Scholar

Summary

The paper challenges the assumption that perplexity accurately reflects LLMs' long-text understanding by demonstrating its focus on local information.
Experiments with models like YARN-7B-128K and LongLoRA-7B-100K reveal a disconnect between low perplexity scores and superior performance in downstream tasks.
Results suggest the need for novel, multidimensional evaluation metrics that capture both local and global contextual comprehension in large language models.

Understanding the Limitations of Perplexity in Long-Text Understanding

Introduction

The paper "Can Perplexity Reflect LLM's Ability in Long Text Understanding?" addresses the prevalent assumption in AI research that perplexity (PPL) is a reliable metric for evaluating LLMs' (LLMs) ability to handle long texts. The authors challenge this assumption by demonstrating that PPL primarily measures the capacity to capture local information rather than understanding long-range dependencies in text. This research underscores the need for more sophisticated metrics for assessing the comprehension of long texts by LLMs.

Evaluation of Perplexity as a Metric

Perplexity is widely used to evaluate LLMs by assessing how well a model predicts a sample of text. A lower PPL indicates a higher prediction accuracy. However, the paper argues that PPL is inadequate for evaluating long-text understanding, as it fails to measure a model's capability to grasp long-range dependencies. Experiments were conducted using three long-context-window LLM variants across benchmarks involving question answering and summarization tasks. Results showed a lack of correlation between PPL scores and actual performance in these semantic tasks.

Experimental Analysis

The study employed three model variants: YARN-7B-128K, Yi-6B-200K, and LongLoRA-7B-100K, with context windows exceeding 100K tokens. They were evaluated on both language modeling tasks and downstream tasks using datasets such as QMSUM and NarrativeQA. For instance, YARN exhibited the lowest PPL yet did not perform best on downstream tasks, with LongLoRA outperforming the other models. This inconsistency illustrates that a model's PPL does not necessarily correlate with its long-text understanding abilities.

Perplexity and Local Information Modeling

The investigation further delved into the hypothesis that PPL predominantly reflects a model's ability to manage local information rather than understanding entire text passages. An experiment with LLaMA2-7B, constrained to a short context of 4,096 tokens, demonstrated that it delivered a comparable PPL to models with much longer context windows. These findings suggest that PPL is more indicative of local coherence rather than comprehension of expanded contexts.

Implications and Future Prospects

The findings highlight significant implications for AI research, particularly in designing evaluation metrics for LLMs. The inadequacy of PPL as a sole metric for evaluating long-text understanding requires the development of new metrics that account for both local and global contextual comprehension. This paper invites researchers to explore multidimensional evaluation approaches that better capture the nuanced capabilities of LLMs in processing extended text inputs.

Conclusion

In conclusion, the research presented in this paper challenges the conventional reliance on PPL as a comprehensive measure of an LLM's long-text processing capabilities. While effective for assessing local language modeling tasks, PPL falls short as an indicator of comprehension in long-text contexts. This calls for the AI community to innovate and adopt more diversified metrics to ensure robust evaluation of LLMs' understanding abilities, fostering advancements in language processing technologies.