Narrative Continuity Test (NCT)

Updated 26 January 2026

Narrative Continuity Test (NCT) is a formal framework defining and measuring narrative coherence, entity tracking, and diachronic continuity in multimodal AI outputs.
It employs detailed metrics such as precision, recall, and composite QA accuracy to assess performance and diagnose issues like hallucination and omission.
NCT guides the design of AI systems by highlighting the need for robust narrative persistence and consistent agent identity over time.

The Narrative Continuity Test (NCT) is a suite of formalized evaluation frameworks for measuring how well AI systems—especially multimodal LLMs (MLLMs) and generative video/text models—maintain coherent, grounded, and diachronic continuity in generated outputs or interactions. Across its diverse instantiations, NCT benchmarks narrative persistence both in terms of perceptual entity tracking in videos, narrative expression in long-form video generation, the resistance to prior-induced hallucination or omission, and, in conceptual scope, the diachronic identity of AI agents themselves.

1. Formal Definitions and Core Concepts

NCT formalism is rooted in the principle that robust AI systems must consistently represent, track, and reason about narrative structures—whether as temporal sequences of video events, evolving entity-centric storylines, or the persistent identity of dialog agents.

Entity-Centric Video NCT (Ha et al., 3 Jan 2026): Given a video $V = \{I(t_0), I(t_1), \dots, I(t_N)\}$ , main entities $\{e_1, \ldots, e_K\}$ are extracted, each with a temporally indexed trajectory $\tau_{e_i} = \{(t_{ij}, b_{ij}, a_{ij}, s_{ij}, o_{ij})\}$ (timestamp, bounding box, action, scene, outfit).
Long-Video Generation NCT (Feng et al., 15 Jul 2025): Narrative is decomposed into Temporal Narrative Atoms (TNAs): the smallest unit of continuous visual narrative, segmented at scene, object, or action changes. A prompt $p$ explicitly encodes $n$ TNAs, controlling narrative complexity.
Agent Identity NCT (Natangelo, 28 Oct 2025): At each time $t$ , an agent’s state $S(t)$ is a 5-tuple: $(M, G, E, V, R)$ (Situated Memory; Goal Set; Error Register; Voice Profile; Persona/Role), and continuity is measured along corresponding axes.
Narrative Prior NCT (Lee et al., 9 Nov 2025): By algorithmically inserting extraneous events into composite videos, the protocol quantifies model tendencies to hallucinate or omit content for narrative fluency.

Overall, the NCT is mathematically instantiated as a set of model queries $Q_m$ grounded in explicit, temporally-anchored entity/state representations, scored by the match between predicted answers $\hat{A}_m$ and reference $A_m$ , i.e.,

$\mathrm{Accuracy}_{\mathrm{NCT}} = \frac{1}{M} \sum_{m=1}^M \mathbf{1}[\hat{A}_m = A_m].$

2. Methodological Frameworks and Metric Design

2.1 Entity-Centric Narrative Reasoning

The NarrativeTrack benchmark (Ha et al., 3 Jan 2026) applies a fully automated pipeline:

Entity Detection: Output set $\mathcal{B}_\mathrm{final}$ by spatially merging Detectron2 and Owlv2 bounding boxes.
Tracking: Re-ID embedding clustering identifies top $k$ main entities; face recognition and MLLM voting suppress drift.
Contextual Recognition: For each entity and timepoint, Gemini-2.5-Pro assigns action, outfit, and scene context.

NCT is operationalized via the Compositional Reasoning Progression (CRP):

Level 1: Entity existence over time (e.g., binary/MC presence queries).
Level 2: Attribute/scene/action shifts (e.g., “What action does $e_i$ perform at $t_j$ ?”).
Level 3: Ambiguity/disambiguation (selecting target among similar entities).

Metrics include precision, recall on detection/tracking, and fine-grained accuracy per CRP dimension and query type.

2.2 Narrative Expression in Video Generation

NarrLV (Feng et al., 15 Jul 2025) structures evaluation around controlled variation in TNAs.

Prompt Pipeline: Using large-scale LLM-parsed scene–object pairs ( ${\sim}16{,}000$ ), prompts are generated specifying $n$ TNAs by factor (scene, object attribute, object action).
MLLM QA Scoring: For each (prompt, generated video) pair, LLMs generate targeted questions for three dimensions:
- Element Fidelity: Are required elements present? $R_{\mathrm{fid}}$
- Unit Coverage: Are all $n$ TNAs expressible? $R_{\mathrm{cov}}$
- Unit Coherence: Are transitions between TNAs logical? $R_{\mathrm{coh}}$

Scoring is operationalized via repeated MLLM answer aggregation, with $N_{\mathrm{exp}} = R_{\mathrm{cov}} \times n$ denoting effective TNA expression. Human agreement rates (up to $0.81$ for $R_{\mathrm{fid}}$ ) substantiate metric alignment.

2.3 Narrative Prior and Error Diagnosis

In NOAH (Lee et al., 9 Nov 2025), NCT isolates model reliance on narrative priors:

Composite Video Construction: Target videos from ActivityNet-Captions are spliced with semantically similar/dissimilar clips at controlled temporal positions; 9,000 distinct composites are generated for systematic ablation.
QA Tasks: Existence, Temporal, and Narrative tasks probe event grounding and susceptibility to plausible fabrication.
Metrics: Caption Hallucination Rate (CHR), Caption Omission Rate (COR), Event-level Hallucination/Omission (EHR/EOR/IEOR), with rigorous pairwise QA accuracy.

2.4 Agent Identity and Diachronic Continuity

The conceptual NCT proposed in (Natangelo, 28 Oct 2025) measures agent persistence along five axes, each with explicit formal metrics—for example, recall rate for Situated Memory, cosine style similarity for Stylistic & Semantic Stability, longitudinal survival of goals and error corrections, and strict role alignment.

3. Empirical Findings and Benchmark Results

Experimental outcomes across benchmarks offer convergent evidence:

NarrLV: As $n$ (TNA count) increased, $R_{\mathrm{cov}}$ and $R_{\mathrm{coh}}$ declined sharply; $R_{\mathrm{fid}}$ was relatively stable ($0.65$–$0.85$). Effective expressible complexity, $N_{\mathrm{exp}}$ , plateaued near 2 TNAs even as $n$ approached 6, indicating sub-linear expressivity scaling. The best open-source long-video models exceed foundation models on $R_{\mathrm{cov}}$ and $R_{\mathrm{coh}}$ , but are ultimately bounded by foundational capacity (Feng et al., 15 Jul 2025).

| Model | $R_{\mathrm{fid}}$ | $R_{\mathrm{cov}}$ | $R_{\mathrm{coh}}$ | |------------|---------------------|--------------------|--------------------| | Wan | 0.82 | 0.71 | 0.52 | | CogVideoX | 0.74 | 0.61 | 0.42 | | RIFLEx | 0.69 | 0.56 | 0.40 | | FreeLong | 0.79 | 0.58 | 0.40 | | FreeNoise | 0.78 | 0.57 | 0.39 | | FIFO-Diff. | 0.76 | 0.58 | 0.38 | | TALC | 0.48 | 0.32 | 0.22 |

NOAH: Open-source Video LLMs produce extreme omission (COR $\approx 0.98$ –$1.00$) and hallucination (CHR $\ge 0.7$ ) rates even under minimal stimulus. Closed models modestly improve (CHR $0.44$–$0.49$, COR $0.57$–$0.86$) but continue to omit or fabricate nearly half of events. Error patterns are strongly modulated by semantic similarity and the temporal insertion point—end insertions preferentially induce global narrative rewriting (Lee et al., 9 Nov 2025).
NarrativeTrack: Entity-centric NCT reveals a consistent trade-off: general-purpose MLLMs offer strong perceptual grounding but poor temporal narrative continuity; video-specialized MLLMs show improved temporal reasoning yet hallucinate entity context under ambiguous dynamics. Directional bias experiments confirm that forward and backward narrative queries are not equally robust, exposing systemic temporal asymmetry (Ha et al., 3 Jan 2026).
Agent NCT: Empirical case analyses (Character.AI, Grok, Replit, Air Canada) consistently expose failure to persist goals, memory, corrections, style, and persona across significant temporal gaps—a direct outcome of stateless inference (Natangelo, 28 Oct 2025).

4. Recommendations and Limitations

Robust NCT Regimes:

Combine open-ended (captioning) and targeted (QA) probes for comprehensive failure detection.
Apply controlled, parametric manipulations (e.g., TNA complexity, entity ambiguity, composite insertion) to attribute model lapses to architectural, data, or inductive biases.
Employ denser temporal sampling and multimodal evidence grounding to mitigate narrative prior errors.
For agent systems, enforce persistent internal state with explicit state update architectures for memory, goals, error correction, style, and persona.

Known Limitations:

Most current NCT instantiations depend on automated entity/context annotation pipelines, which, while scalable, may miss subtle modeling errors or emergent behaviors.
Human agreement substantiates metric validity, but direct correlations, e.g., between $R_{\mathrm{fid}}$ and subjective naturalness, are not exhaustively characterized.
Agent identity NCT is conceptual/propositional; no large-scale, automated, cross-session evaluation corpus is yet available (Natangelo, 28 Oct 2025).

5. Significance and Research Implications

NCT advances narrative evaluation beyond fluency or local accuracy, introducing rigorous, fine-grained measurements of continuity, coherence, and grounding. In video, this facilitates diagnosis of entity tracking, narrative drift, and the trade-off between perceptual detail and temporal coherence (Ha et al., 3 Jan 2026, Lee et al., 9 Nov 2025, Feng et al., 15 Jul 2025). In dialog or agent scenarios, it highlights the necessity for architectures supporting diachronic identity persistence—a prerequisite for trustworthy, reliable long-term AI operation (Natangelo, 28 Oct 2025).

This suggests a paradigm shift: Natural language or video generation models should not only optimize for static task success but for persistent, temporally structured narrative reasoning and identity—the backbone of robust, persistent, agentic behavior.

6. Relationship to Adjacent Frameworks

NCT substantially generalizes earlier video QA and evaluation protocols:

Compared to VBench, Video-MME, LVBench: NCT introduces explicit entity-centric persistence, temporal ambiguity, and complex compositional queries (Ha et al., 3 Jan 2026).
Versus StoryEval, VBench 2.0: The human alignment rates of NCT QA (up to 0.81–0.80) significantly exceed those prior benchmarks (Feng et al., 15 Jul 2025).
In the context of prior “hallucination/omission” work: NOAH’s NCT defines and quantifies these error classes in terms of narrative priors, not merely as factual errors in isolation (Lee et al., 9 Nov 2025).

The conceptual NCT for agent identity (Natangelo, 28 Oct 2025) stands apart as the first formal 5-axis decomposition of persistence—including situated memory, goal survival, continuous self-correction, stylistic and semantic consistency, and role coherence—demonstrating general applicability to any stateless generative model.

7. Future Directions

Advancing NCT-based evaluation prompts architectural research on compositional memory, hierarchical goal modeling, persistent error tracking, and hybrid visual–language temporal integration. Recommendations include richer evaluation corpora—both synthetic (e.g., controlled composites, ambiguous entities) and real (TV episodes, multi-role dialog)—and development of architectures with explicit persistent state updates to provide identity continuity beyond in-prompt context (Natangelo, 28 Oct 2025, Ha et al., 3 Jan 2026). The expansion and further automation of NCT protocols are likely to become central in benchmarking next-generation AI narrative abilities across modality, timescale, and application domain.