NarrativeQA Reading Comprehension Challenge

Updated 18 January 2026

NarrativeQA is a benchmark that assesses narrative reading comprehension by requiring models to synthesize information across extensive texts like full books and movie scripts.
It features both summary-based and full-text QA settings, employing abstractive and extractive methods evaluated with metrics such as BLEU, ROUGE, and LLM-as-a-Judge.
The challenge drives research in retrieval, event-centric reasoning, and long-context integration, pushing the boundaries of current NLP models.

The NarrativeQA Reading Comprehension Challenge is a benchmark designed to assess the ability of models to perform narrative-level reasoning by answering questions based on the content of entire books or movie scripts, pushing the boundaries of long-context machine reading comprehension. In contrast to standard question answering tasks—which typically operate on brief passages or encyclopedic articles—NarrativeQA requires models to integrate information, reason about characters and events, and generate or select abstractive answers that synthesize material scattered over hundreds of pages or scenes (Kočiský et al., 2017, Sang et al., 2022).

1. Dataset Design and Motivations

NarrativeQA comprises 1,567 narrative texts: approximately half full-length books from Project Gutenberg, and half movie scripts from sources such as IMSDb and DailyScript. Each narrative is paired with a 600–660 token human-written summary (extracted from Wikipedia or similar sources) and, crucially, a collection of 10–15 question–answer (QA) pairs per story, totaling 46,765 QA pairs. Questions were authored against the summaries, not the full texts, ensuring that answers typically require non-local, high-level narrative understanding. Answers are concise—average length 4.7 tokens, seldom exceeding 40 tokens, and are often highly abstractive; only about 30% are exact substrings of the source summary or text (Kočiský et al., 2017, Sang et al., 2022, Bonomo et al., 15 Oct 2025).

The dataset enforces a split by book/script, not merely by individual examples, to test generalization. Narratives span diverse genres (mystery, romance, science fiction, etc.), with a wide range of question types probing character relationships (30%), event timelines (~15%), causal reasoning (~9%), and methods or processes (~8%). Approximately 75% of questions on the full-story setting involve events—either event components (∼34%) or relations between events (∼41%)—which starkly differentiates NarrativeQA from entity- or fact-centric open-domain QA (Mou et al., 2021, Bonomo et al., 15 Oct 2025).

2. Task Formulations and Evaluation Protocols

NarrativeQA presents two main QA settings:

Summary-based QA: Each question is answered using only the human-written summary, enabling direct training and evaluation over ~650-token contexts.
Full-text (book/script) QA: Each question is answered with access to the entire narrative text (up to 430K tokens), but not the summary. This “full-story” setting is uniquely demanding due to the extreme length and diffuse evidence.

Each setting supports both answer generation (free-form abstractive) and answer selection (ranking a set of candidate answers). Additionally, an extractive span-prediction formulation is used in some work, where the model outputs start and end positions in the retrieved context window (Kočiský et al., 2017, Chaudhary et al., 2018, Nishida et al., 2019).

Evaluation Metrics include BLEU-1/4, METEOR, ROUGE-L for generative tasks, and Mean Reciprocal Rank (MRR) for ranking tasks. These metrics are computed with lowercased text and after punctuation normalization (Kočiský et al., 2017, Nishida et al., 2019, Chaudhary et al., 2018). However, n-gram overlap metrics have limited correlation with human judgments in this abstractive, paraphrastic regime; recent work instead advocates for LLM-as-a-Judge metrics that directly score candidate answers using reference answers and summaries (Bonomo et al., 15 Oct 2025).

3. Modeling Approaches for NarrativeQA

The challenge presented by NarrativeQA—extreme context length, complex event-centric and relational questions, non-extractive answers—has driven development in several directions:

Retrieval + Reader Pipelines: Early models (e.g., IR-overlap baselines, ASReader, BiDAF) use ad hoc retrieval of the most relevant passages via n-gram or embedding-based similarity, followed by span extraction or pointer-generator decoding over the retrieved context (Kočiský et al., 2017, Tay et al., 2019).
Generative Pointer-Generator Models with Curriculum Learning: IAL-CPG (Tay et al., 2019) combines an Introspective Alignment Layer (IAL) employing block-based local self-attention with a dual-axis curriculum regime (answerability and understandability), achieving significant gains (+51% BLEU-4, +17% ROUGE-L relative improvement). The pointer-generator decoder softens the copy mechanism, enabling generative answers even when the gold does not appear verbatim in the context, and curriculum alternation increases robustness to retrieval variation.
Transformer/Memory-Augmented Methods: Masque (Nishida et al., 2019) leverages a shared Transformer encoder and style-conditioned, multi-source pointer-generator decoder to transfer multi-style NLG capabilities into the concise, pronoun- and paraphrase-rich NarrativeQA setting—yielding state-of-the-art abstractive summary performance.
Memory-Augmented and Hierarchical Models: ReadTwice (Zemlyanskiy et al., 2021) employs a two-pass scheme where the full book is split into overlapping segments, each encoded in parallel. Entity-linked memories are extracted and used to augment a second pass through the text; this enables better encoding of long-range dependencies, with substantial improvements in ROUGE-L and BLEU-1 scores.
Advanced Open-Domain QA Transfers: Recent studies adapt ODQA techniques: dense/sparse retrieval using BM25 or BERT rankers, distant supervision, ICT pretraining, and fusion-in-decoder generative readers. Enhanced systems such as BART+FiD, with “book prereading” adaptation, achieve SOTA in full-book settings (e.g., BART+FiD+Book Prereading+DS+ICT: 29.21% ROUGE-L) (Mou et al., 2021).
Long-context LLMs: The advent of models with 1M+ token context windows (e.g., Gemini 1.5 Pro) enables ingesting whole books directly. Empirical studies show that such models, when answering with the full book as context, outperform retrieval-augmented and no-context baselines by significant margins using LLM-based evaluators (Bohnet et al., 2024).

4. Evaluation Metrics and Meta-Evaluation

NarrativeQA originally used n-gram–based metrics (BLEU, ROUGE-L, METEOR, F1, EM); however, these metrics show poor system-level correlation to human judgments (Kendall's τ ~ 0.03 for ROUGE-L/EM/F1). Only METEOR achieves moderate correlation (τ ≈ 0.44 on the LiteraryQA subset) (Bonomo et al., 15 Oct 2025). LLM-as-a-Judge approaches—where advanced transformers (e.g., Prometheus 2 7B, Claude 3.7 Sonnet, GPT 4.1) score candidate answers using both references and summaries—consistently yield much higher correlation (τ ≈ 0.69 when provided with summary context).

Summary-based LLM judgments surpass reference-only settings, particularly in handling paraphrase and minor variation. Automated relative ranking via side-by-side pairwise evaluation and Bradley-Terry modeling further differentiates system performance and aligns closely with human preferences (Bohnet et al., 2024).

The LiteraryQA subset, curated and cleaned for narrative purity and QA quality, establishes best practices for NarrativeQA evaluation, combining LLM-validated item selection with robust automatic grading.

5. System Performance and Major Findings

The following table summarizes key results for prominent systems on NarrativeQA full-book or summary settings (as reported):

Model	BLEU-1	BLEU-4	METEOR	ROUGE-L	MRR	Source
BiDAF (Summary)	33.7	15.5	15.4	36.3	—	(Nishida et al., 2019)
Masque (Summary)	54.1	30.4	26.1	59.9	—	(Nishida et al., 2019)
IAL-CPG (Full Story, 4K ctx)	—	2.47	—	17.67	—	(Tay et al., 2019)
ReadTwice (Full Book)	21.1	4.0	7.0	23.3	—	(Zemlyanskiy et al., 2021)
BART+FiD+Preadapt (Full Book)	—	—	—	29.21	—	(Mou et al., 2021)
WGN–MLP (Selection, Summary)	—	—	—	—	0.621	(Chaudhary et al., 2018)
BookQA, KV-MemNet+BERT	—	—	—	—	0.376	(Angelidis et al., 2019)

Absolute performance on full-book QA, even with the best models, remains far below human upper bounds (e.g., BART+FiD+Book Prereading at 29.21% ROUGE-L versus oracle IR at 39.32% and human performance at 57–59.9% ROUGE-L) (Mou et al., 2021, Nishida et al., 2019). LLMs with access to the full book context (Gemini 1.5 Pro, 1M tokens) further raise factual accuracy to approximately 80% (AutoAIS metric), compared to ~60–68% for no-context or RAG-4k retrieval regimes (Bohnet et al., 2024).

Recent meta-evaluations confirm that relying solely on overlap metrics yields misleading rankings for this task. Instead, LLM-based grading and summary-enhanced references are essential for robust evaluation (Bonomo et al., 15 Oct 2025).

6. Key Challenges and Open Research Problems

Several persistent challenges are highlighted throughout the literature:

Retrieval Difficulty: Locating the handful of truly pertinent passages within a 100k+ token narrative remains a major bottleneck; superficial lexical matching is typically insufficient as most answers are highly abstractive and not local lexical matches (Mou et al., 2021, Angelidis et al., 2019).
Event-centric and Multi-hop Reasoning: The high proportion of event and relation questions requires synthesizing events across narrative spans, resolving coreference, and interpreting elaborate causal or temporal chains. Existing models lack robust event argument identification and long-range relational semantics (Kočiský et al., 2017, Mou et al., 2021).
Commonsense and Pragmatic Reasoning: Many questions necessitate implicit inference—e.g., understanding character motivations, tracking narrative arcs, inferring unstated causal connections—which current architectures and pretraining regimes do not fully address (Angelidis et al., 2019, Sang et al., 2022).
Evaluation Fragility: Gold-standard answers exhibit paraphrastic diversity; as such, n-gram metrics penalize legitimate variants. LLM-as-a-Judge mitigates this but incurs resource costs (Bonomo et al., 15 Oct 2025).
Data Quality and Benchmark Limitations: Noisy documents (boilerplate, mismatched text), erroneous or ill-posed QA pairs, and lack of fine-grained genre or structural annotation led to the proposal and construction of LiteraryQA, a high-quality filtered subset (Bonomo et al., 15 Oct 2025).

7. Prospects and Recommendations

Current analyses suggest that progress on NarrativeQA necessitates:

Better retrieval objectives, possibly combining dense and sparse signal, tuned specifically for cross-event coherence and summary generation (Mou et al., 2021).
Architectures that directly incorporate event structure (event schema induction, graph-based discourse models) and long-range memory, enabling cross-passage synthesis in very large contexts (Zemlyanskiy et al., 2021, Bohnet et al., 2024).
Advanced evaluation: LLM-as-a-judge models with access to both references and narrative summaries, summary-based human-in-the-loop assessment, and granular diagnostics for event/character/setting reasoning (Bonomo et al., 15 Oct 2025).
Rigorous data curation, with pipelines for deduplication, QA correction, and boilerplate removal, as implemented in LiteraryQA (Bonomo et al., 15 Oct 2025).
Expanded and diversified benchmarks targeting functional structure, setting, pragmatic inference, and challenging genre phenomena (e.g., flashback, unreliable narration) (Sang et al., 2022).

NarrativeQA—together with its LiteraryQA subset and recent LLM-centric evaluation methodologies—thus defines the frontier task for narrative reading comprehension at scale, revealing both the limits of current systems and the requirements for models capable of deep, integrative narrative understanding.