Agent-as-Judge for Factual Summarization of Long Narratives

Published 17 Jan 2025 in cs.CL | (2501.09993v1)

Abstract: LLMs have demonstrated near-human performance in summarization tasks based on traditional metrics such as ROUGE and BERTScore. However, these metrics do not adequately capture critical aspects of summarization quality, such as factual accuracy, particularly for long narratives (>100K tokens). Recent advances, such as LLM-as-a-Judge, address the limitations of metrics based on lexical similarity but still exhibit factual inconsistencies, especially in understanding character relationships and states. In this work, we introduce NarrativeFactScore, a novel "Agent-as-a-Judge" framework for evaluating and refining summaries. By leveraging a Character Knowledge Graph (CKG) extracted from input and generated summaries, NarrativeFactScore assesses the factual consistency and provides actionable guidance for refinement, such as identifying missing or erroneous facts. We demonstrate the effectiveness of NarrativeFactScore through a detailed workflow illustration and extensive validation on widely adopted benchmarks, achieving superior performance compared to competitive methods. Our results highlight the potential of agent-driven evaluation systems to improve the factual reliability of LLM-generated summaries.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces NarrativeFactScore, an Agent-as-Judge framework that uses a Character Knowledge Graph to assess factual accuracy in long narratives.
It employs iterative graph-based refinement to improve summary consistency, achieving statistically significant alignment with human factual assessments.
This approach advances summarization by reducing errors in character dynamics and plot details, paving the way for broader AI narrative applications.

Agent-as-Judge for Factual Summarization of Long Narratives

The paper "Agent-as-Judge for Factual Summarization of Long Narratives" introduces an innovative approach to evaluating and refining summaries of extensive textual narratives. The authors identify a significant gap in current summarization evaluation metrics, which often fail to account for the factual accuracy of summaries, particularly in long narratives exceeding 100K tokens. This work presents NarrativeFactScore—a novel "Agent-as-a-Judge" framework designed to enhance factual consistency in LLM-generated summaries and proposes a Character Knowledge Graph (CKG) to fortify agents' judgment abilities.

Introduction and Motivation

The rise of LLMs has significantly impacted summarization tasks, achieving considerable performance in lexical and semantic similarity metrics like ROUGE and BERTScore. Nonetheless, these metrics inadequately measure factual accuracy, leaving narratives susceptible to errors, especially in the intricate field of character relationships and their developments. Prior advances, such as LLM-as-a-Judge, have attempted to fill this void but still demonstrate limitations in consistent factual reasoning.

Proposed Method

The authors propose NarrativeFactScore, utilizing an "Agent-as-a-Judge" framework which leverages a CKG for evaluating factual consistency in story summaries. The CKG is developed by extracting character relationships and states from both source texts and generated summaries. This graph-based approach allows NarrativeFactScore to more accurately assess summaries by incorporating complex character dynamics and making the evaluation process interpretable and actionable.

The core processes involve:

CKG Extraction: A systematic extraction and unification of names and character relationships across narrative scenes to maintain consistency, inspired by self-consistency reasoning strategies.
Factuality Scoring: Each summary is decomposed into atomic facts that are validated against the narrative using the CKG, providing a score that measures factual accuracy relative to the original narrative.
Agent-based Refinement: Utilizing the segmentation of feedback from NarrativeFactScore, summaries undergo iterative refinement for improved accuracy.

Results and Implications

The framework was validated through extensive experimentation on well-established benchmarks, demonstrating superior factuality and consistency over existing methods. The proposed NarrativeFactScore showed statistically significant alignment with human evaluators, exhibiting a correlation with human factuality assessments at a p-value of 0.00003. The results are substantiated by increased factual accuracy and improved performance metrics (ROUGE, BERTScore) when applied to movie scripts and other long-form narrative datasets.

Practical and Theoretical Implications

Practically, the integration of NarrativeFactScore into narrative summarization workflows can enhance the factual reliability of generated content, reducing manual factual verification efforts and costs. Theoretically, the method signifies a shift towards more sophisticated, graph-based comprehension models that allow for intricate evaluation of textual relationships, a necessity for advancing the infrastructure of AI's understanding capabilities.

Future Developments

Moving forward, this agent-guided approach posits the potential for applications beyond summarization, including narrative generation and interacting system evaluations, where tracking character dynamics and plot intricacies are crucial. The study encourages further exploration into enhancing the breadth and depth of CKGs, enabling more robust, multifaceted narrative understanding and generation within AI systems.

The work distinctly positions itself by addressing the nuanced evaluation of factuality within long narratives, a critical yet underdeveloped area, providing a robust foundation for future AI advancements in natural language understanding and generation tasks.

Markdown Report Issue