ReadAgent: Efficient Long-Context QA

Updated 7 January 2026

ReadAgent is a prompting-based agent that decomposes long documents into manageable semantic units and compresses them into gist memories to overcome context window limitations.
It employs parallel and sequential retrieval modes to selectively re-read and refresh relevant episodes, balancing accuracy with token efficiency.
Empirical evaluations on benchmarks like QuALITY and NarrativeQA demonstrate its superior performance and effective compression compared to traditional RAG methods.

ReadAgent is a prompting-based agent architecture designed to solve long-context question answering (QA) and reading comprehension tasks using LLMs with limited context windows. Developed as a response to the limitations of both standard LLMs and traditional retrieval-augmented generation (RAG) pipelines, ReadAgent decomposes long documents into manageable semantic units, maintains compressed gist memories, and interactively re-reads selected passages for detail retrieval. By tightly orchestrating memory formation and selective raw-text retrieval, it extends practical context length by an order of magnitude, while minimizing reliance on external retrievers or vector databases. ReadAgent’s approach, inspired by human episodic memory and selective re-reading strategies, has been extensively benchmarked—both as a competitive baseline and a point of comparison for newer agentic or graph-based reading systems (Lee et al., 2024, Laitenberger et al., 4 Jun 2025, Li et al., 2024).

1. System Architecture and Workflow

ReadAgent comprises four main phases—episode pagination, memory gisting, interactive look-up, and answer generation. Pagination is handled via LLM prompt engineering: the model reads the source document in sliding windows and is prompted to propose “natural” break points, such as scene or section boundaries. Each resulting segment, or “page," forms the basis for a memory episode.

For each episode $p_i$ , the LLM generates a shortened abstractive summary termed a “gist” ( $g_i$ ), which forms a compressed memory buffer (gist memory). This enables fitting multi-thousand-token documents into the LM’s context window.

At inference (query) time, the model receives:

The query $q$
The ordered gist memory $\{g_1, \ldots, g_N\}$

It is instructed, via prompting, to select 1–6 episodes to “look up” (i.e., substitute the short gist with the full original text for those episodes) based on perceived relevance to $q$ . The final LM input is a hybrid prompt:

$\langle q,\; g_1,\dots,g_N,\; p_{i_1},\dots,p_{i_K}\rangle$

where the selected $\{p_{i_k}\}$ are concatenated in the original document order.

ReadAgent supports two operational modes:

Parallel retrieval (ReadAgent-P): the LM selects all look-up pages in one pass.
Sequential retrieval (ReadAgent-S): the LM repeatedly selects and refreshes one new page at a time until signaling “STOP,” potentially increasing recall for complex queries (Lee et al., 2024).

2. Memory Formation and Compression

Episode formation is mediated by the LLM itself, which is prompted at each sliding window to pick breakpoints at meaningful narrative or topical boundaries. Empirical ablations confirm that LLM-chosen segmentation yields superior coherence and slightly higher accuracy versus uniform-length chopping (86.63% vs. 85.71% on QuALITY) (Lee et al., 2024).

Each episode is then summarized via a concise LLM prompt (“Shorten the following passage…”), with the result prepended by a page label. This process achieves substantial compression rates (CR):

$\text{CR} = 100\% \times \left(1 - \frac{\text{word-count(context used)}}{\text{word-count(full original)}}\right)$

On QuALITY, gist-only memory yields a CR of 84.24%, enabling storage of approximately three times more document text within the model’s effective context window. The hybrid approach (gist plus selective page substitutions) typically achieves CRs in the 60–70% range.

Episodes may be merged further in high-compression domains using an additional LLM merge prompt (Lee et al., 2024).

3. Interactive Retrieval and Answer Construction

Unlike classical retrieval-augmented methods, ReadAgent does not use independent vector retrieval, similarity scoring, or embedding models. Instead, given a query, the LLM receives the full gist sequence and is prompted to choose the minimal set of pages necessary to “refresh” memory for the task.

For each retrieval step, the model is instructed not to select more pages than necessary, serving as a built-in form of budget control. The hybrid prompt—gist memory with raw full text swapped in for the selected episodes—is presented to the LLM, which is then tasked with answer generation.

Comparative baselines include:

BM25 over term vectors (no neural encoding).
Neural embedding (e.g., Gemini API, 1024-D vectors) with dot-product scoring.

These baselines retrieve the top- $k$ relevant pages by external scoring, whereas ReadAgent’s retrieval is end-to-end LLM mediated (Lee et al., 2024).

4. Token Efficiency and Scalability

ReadAgent dynamically adapts its effective context length through aggressive gist compression and minimal-necessary lookups. There is no hard retrieval budget; in practice, token usage per query varies:

$\infty$ -Bench (En.MC, GPT-4o-mini): ~86K tokens input (pages + gists)
NarrativeQA: ~34K tokens per example
QuALITY: ~4.8K tokens per example (Laitenberger et al., 4 Jun 2025)

This dynamic scaling stands in contrast to strict token budget enforcement in most RAG baselines, which cap retrieved text at fixed budgets (e.g., 1.5K, 5K, 40K tokens).

Compression applies both at memory formation and episodic replacement: typical queries require 1–3 lookups, with up to 6 for harder or less structured transcripts (e.g., QMSum).

5. Empirical Evaluation

ReadAgent has been evaluated on multiple long-document reading comprehension benchmarks:

QuALITY: Multiple-choice, 2K+ MC questions, passages of 2–8K tokens. On dev, ReadAgent-P (look up 1–5 pages) achieves 86.63% accuracy, surpassing both BM25 (84.28%) and full-text (85.83%) approaches (Lee et al., 2024).
NarrativeQA: Free-form QA over books and movie scripts; documents of up to 343K words. ReadAgent outperforms neural and BM25 retrieval methods by +12.97% absolute in strict LLM ratings and +31.98% ROUGE-L on full-length books.
QMSum: Meeting transcription summarization. ReadAgent-S (sequential look-up) outperforms all baselines in LLM rating-1 by +6.5%.

A summary of critical results (Table format):

Task	Method	Compression Rate	Metric	Score
QuALITY (MC)	ReadAgent-P	66.26%	Accuracy	86.63%
NarrativeQA (Gutenberg)	ReadAgent-P (1 pg)	94.84%	LLM Rating-1	59.98%
QMSum	ReadAgent-S	70.34%	LLM Rating-1	46.57%

ReadAgent successfully extends effective LLM context length by 3–20x on reading comprehension tasks, supporting document inputs far beyond the nominal context window of the base LLM (Lee et al., 2024, Laitenberger et al., 4 Jun 2025).

6. Comparative Analysis and Limitations

Recent controlled benchmarks indicate that as LLM context size increases (e.g., GPT-4o, tens of thousands of tokens), simpler baselines such as Document’s Original Structure RAG (DOS RAG) match or outperform multi-stage agentic methods, including ReadAgent, under equivalent or smaller token budgets (Laitenberger et al., 4 Jun 2025). Key reasons include:

Direct retrieval from original text yields higher answer accuracy than retrieval from summaries.
Higher recall (larger raw context) is more valuable than reduced “precision” from aggressive filtering.
Preserving original passage order improves narrative continuity.
Additional model calls and token usage in multi-stage pipelines do not compensate for the ability of LLMs to process large, ordered contexts.

In ablation studies, DOS RAG consistently surpasses ReadAgent by 2–3% accuracy on En.MC and NarrativeQA, while using simpler mechanisms and lower average context sizes.

7. Relationship to Other Agent-Based Long-Context Systems

Variants and generalizations of the ReadAgent paradigm have emerged. Notably, GraphReader structures documents as graphs of “atomic facts” and “key elements,” employing an agent to explore these compact representations in a coarse-to-fine manner (Li et al., 2024). GraphReader achieves linear scaling in inference complexity and, using a 4K token window, outperforms GPT-4-128k on LV-Eval benchmarks up to 256K tokens, highlighting the ongoing relevance of agentic, selective reading for scenarios where extreme context length or nonlinear narrative structure presents challenges.

These approaches share ReadAgent’s fundamental principles—episodic division, compressed gist memories, selective re-reading—but differ in the use of explicit document structure (e.g., graphs), modular function calls for exploration, and algorithmic planning by the agent.

ReadAgent exemplifies a class of prompting-based, agentic LLM methods engineered for efficient long-context QA and reading comprehension in settings where both context window constraints and LLM attention bottlenecks present obstacles. Its effectiveness, scalability, and interaction with retriever-based and fully-attentive methods establish it as both a reference point and a practical tool in the evolving landscape of long-context language modeling and retrieval-augmented reasoning (Lee et al., 2024, Laitenberger et al., 4 Jun 2025).