Lost in the Middle: How Language Models Use Long Contexts

Published 6 Jul 2023 in cs.CL | (2307.03172v3)

Abstract: While recent LLMs have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of LLMs on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current LLMs do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how LLMs use their input context and provides new evaluation protocols for future long-context LLMs.

Abstract PDF Upgrade to Chat

Citations (1,001)

View on Semantic Scholar

Summary

The paper reveals a U-shaped performance curve, showing that information in the middle of long texts is underutilized compared to the boundaries.
The paper shows that encoder-decoder models maintain robustness within familiar sequence lengths but lose this advantage with extended contexts.
The paper demonstrates that query-aware contextualization can boost performance in retrieval tasks while instruction fine-tuning offers limited mitigation.

Overview

LLMs (LMs) have become a cornerstone of various applications within the field of Artificial Intelligence. With the advent of models that can parse and generate natural language, the scope of applications has expanded tremendously. However, one critical aspect that remains under-explored is how these models leverage long input contexts, especially given their inherent ability to process thousands of tokens simultaneously. The study conducted by Liu et al. sheds light on this particular aspect, providing insights that could influence future developments in the field.

Understanding Model Performance Across Contexts

The study meticulously analyzes the performance of different state-of-the-art LLMs across two main tasks: multi-document question answering and key-value retrieval. The key takeaway is rather enlightening yet concerning: the performance of LMs suffers significantly when the relevant information is nestled in the middle of the input context. This finding is consistent across various models, including those explicitly designed to handle long contexts.

The analysis reveals a distinctive U-shaped curve in model performance, where models perform optimally if the relevant information is placed at the beginning or the end of the input context. This fundamental revelation underpins a primacy and recency bias within these models, highlighting a significant gap in their ability to uniformly process information throughout the input context.

Delving Deeper Into Model Capabilities

Further investigations into factors such as model architecture (encoder-decoder vs. decoder-only), query-aware contextualization, and instruction fine-tuning reveal nuanced insights. Encoder-decoder models, for instance, demonstrate a relative robustness to the position of the relevant information but only within sequence lengths encountered during their training regime. This robustness dissipates with longer sequences, reinstating the observed U-shaped performance curve.

Query-aware contextualization showed promise, particularly in key-value retrieval tasks, indicating that how information is presented to the model (such as encapsulating the query within the context) can drastically enhance performance. Interestingly, instruction fine-tuning exhibited a minimal influence on mitigating the observed biases, suggesting that the root causes might be more deeply ingrained within the models' architecture or training methodology.

Practical Implications and Future Directions

The empirical findings of this study bear significant implications for the application of LMs in real-world scenarios. For instance, in open-domain question answering, the study reveals a perplexing observation: the performance of LLM-based readers saturates far before the recall capacity of the retriever models. This indicates a fundamental inefficiency in how these models utilize additional context, challenging the assumption that more context invariably equates to better performance.

Concluding Thoughts

The study conducted by Liu et al. provides critical insights into the utilization of long contexts by LLMs, highlighting substantial biases and inefficiencies. These findings not only underscore the limitations of current models in processing uniformly across lengthy inputs but also chart a path for future research aimed at addressing these challenges. As we move forward, understanding and improving how LMs leverage their input context will be paramount in unlocking their full potential across a myriad of applications.

Markdown Report Issue