ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering

Published 4 Oct 2024 in cs.CL | (2410.03227v1)

Abstract: The context window of LLMs has been extended significantly in recent years. However, while the context length that the LLM can process has grown, the capability of the model to accurately reason over that context degrades noticeably. This occurs because modern LLMs often become overwhelmed by the vast amount of information in the context; when answering questions, the model must identify and reason over relevant evidence sparsely distributed throughout the text. To alleviate the challenge of long-context reasoning, we develop a retrieve-then-reason framework, enabling LLMs to reason over relevant evidence collected during an intermediate retrieval step. We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate "retrieved facts", resulting in flawed reasoning and the production of incorrect answers. To address these issues, we introduce ALR$^2$, a method that augments the long-context reasoning capability of LLMs via an explicit two-stage procedure, i.e., aligning LLMs with the objectives of both retrieval and reasoning. We demonstrate the efficacy of ALR$^2$ for mitigating performance degradation in long-context reasoning tasks. Through extensive experiments on long-context QA benchmarks, we find our method to outperform competitive baselines by large margins, achieving at least 8.4 and 7.9 EM gains on the long-context versions of HotpotQA and SQuAD datasets, respectively.

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces ALR, a framework that retrieves topically relevant facts to enhance reasoning over long document contexts.
It demonstrates significant performance gains on HotpotQA and SQuAD, achieving EM improvements of at least 8.4 and 7.9 respectively.
The method effectively reduces hallucinations in long-context QA by segregating the retrieval and reasoning processes.

Retrieve-Then-Reason Framework for Long-Context Question Answering: An Expert Overview

The paper introduces ALR, a novel retrieve-then-reason methodology designed to enhance the performance of LLMs in long-context question-answering tasks. As LLMs have evolved to handle increasingly extensive context windows, their ability to reason effectively over such contexts has not kept pace. The paper identifies and addresses the degradation in performance witnessed when LLMs face long contexts, primarily due to the models being overwhelmed by excessive, distributed information, leading to flawed reasoning often marked by hallucinated "facts."

Key Contributions

Identification of Degradation in Long-Context Reasoning: The authors conduct a preliminary study showing that as context lengths grow, LLMs' reasoning accuracy diminishes more than their retrieval ability. This highlights a critical bottleneck in the utility of LLMs in practical applications requiring long-context reasoning.
Introduction of ALR: The paper introduces the ALR framework, which effectively bridges the gap by first retrieving topically relevant information and then reasoning over this curated subset. This two-stage approach aligns LLMs with distinct retrieval and reasoning objectives, notably enhancing their ability to filter pertinent details and avoid hallucinations.
Performance Evaluation: Extensive experiments on modified versions of the HotpotQA and SQuAD datasets demonstrate the efficacy of ALR. It significantly surpasses existing methodologies, achieving EM score gains of at least 8.4 on HotpotQA and 7.9 on SQuAD. ALR is shown to outperform competitive baselines by at least 23.4 EM on HotpotQA and 12.7 EM on SQuAD.

Methodology

The ALR approach is grounded in a RAG-inspired formulation, wherein retrieval is not an isolated task but a complementary step integrated with diverse reasoning objectives. This method is divided into:

Retrieval Phase: Explicitly extracts relevant facts from long contexts.
Reasoning Phase: Processes these facts to generate well-supported answers, thereby mitigating issues related to excessive information and hallucination.

Experimental Insights

Experiments demonstrated robust performance across contexts ranging from 4K to 128K tokens, with consistent results supporting ALR’s utility in long-context scenarios. Improved retrieval accuracy and reduced hallucination rates were key success metrics that differentiated ALR from other approaches, such as direct-answering and command-based methods.

Implications and Future Directions

The ALR framework represents an important step toward improving LLMs' long-context capabilities. Its implications are multifaceted:

Practical Applications: ALR enhances document analysis, multi-hop reasoning, and agents requiring contextual longevity, enabling broader and more reliable use of LLMs in real-world tasks.
Theoretical Developments: By aligning retrieval and reasoning tasks, the approach bolsters understanding of cognitive processes in AI, inviting further exploration of modular task decomposition.

Looking forward, refining retrieval granularity and expanding ALR to accommodate summarization tasks could further enhance LLM proficiency. Addressing these limitations and focusing on task-specific retrieval strategies will be pivotal in advancing LLMs' ability to manage complex, context-rich queries effectively.