Papers
Topics
Authors
Recent
Search
2000 character limit reached

MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs

Published 27 Oct 2025 in cs.CL and cs.AI | (2510.22967v2)

Abstract: The widespread adoption of LLMs raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.

Summary

  • The paper introduces MAD-Fact, a multi-agent debate framework that decomposes long responses into atomic claims and leverages diverse agent roles for rigorous fact-checking.
  • It develops the LongHalluQA benchmark by employing a multi-stage data curation process to ensure high-quality, culturally relevant Chinese long-form factual evaluation.
  • Empirical results demonstrate MAD-Fact’s efficacy with an 80% win rate and weighted F1 metrics that closely align with human factuality judgments across various LLMs.

Multi-Agent Debate for Robust Long-Form Factuality Evaluation in LLMs

Motivation and Background

Factuality assessment remains a pivotal challenge for the safe deployment of LLMs, especially in high-stakes domains such as law, biomedicine, and education. Existing factuality benchmarks and evaluation protocols typically emphasize short-form question answering, neglecting the complexities introduced by long-form generation, such as multi-perspective reasoning and the cumulative risk of intertwined factual errors. Standard approaches often rely on single-verifier architectures and treat all atomic claims equally, disregarding claim importance. This paper introduces "MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs" (2510.22967), proposing an integrated suite of dataset construction, multi-agent verification, and importance-weighted evaluation metrics. Figure 1

Figure 1: Illustration of core versus auxiliary factual claims; errors on core claims have greater impact on perceived answer quality.

LongHalluQA: Construction of a Chinese Long-Form Factuality Benchmark

Addressing the scarcity of Chinese resources for factuality evaluation, the authors present LongHalluQA, a large-scale, multi-topic benchmark for long-form factual content. The construction follows a rigorous process involving knowledge base curation, multi-round retrieval and verification, semantic clustering, question expansion, and systematic human review to ensure sample consistency and relevance. Figure 2

Figure 2: The three-stage pipeline for LongHalluQA: knowledge base creation, question generation, and sample quality control.

The dataset targets coverage across domains including Chinese culture, science, and social studies, with an 86% retention rate of high-quality samples after manual vetting. Figure 3

Figure 3: Visualization of LongHalluQA's topic distribution, highlighting coverage breadth.

LongHalluQA substantially increases answer length and semantic richness compared to its short-form antecedents HalluQA and ChineseSimpleQA. Figure 4

Figure 4: Comparative examples of responses in HalluQA vs. LongHalluQA, illustrating increased depth and length.

MAD-Fact: Multi-Agent Debate System Architecture

MAD-Fact formalizes long-form factuality evaluation as a three-stage multi-agent interaction: clerk for atomic claim extraction, jury for fact-checking via multi-agent debate, and judge for aggregating verdicts. The clerk decomposes responses into fact-checkable atoms, filtering out non-verifiable or subjective content. The jury, composed of agents with diversified professional roles, debates each claim using direct, retrieval-based, or adaptive strategies. This design supports both autonomous and mandatory evidence retrieval, as well as adaptive continuation or early consensus. Figure 5

Figure 5: MAD-Fact framework overview, demonstrating the pipeline from atomic claim decomposition through multi-agent debate to final judgment.

Agents interact under three systematically constructed debate protocols (autonomous, mandatory, and dynamic retrieval), balancing retrieval cost, prior knowledge utilization, and reliability. Figure 6

Figure 6: Illustration of the three debate rule types implemented in MAD-Fact with N=3N = 3 agents and two rounds.

Role assignments—including Public, Critic, News Author, Scientist, Psychologist, and Data Analyst—increase epistemic diversity, with incentives provided for external evidence retrieval to counter overconfidence and bias propagation.

Fact Importance Hierarchy and Weighted Evaluation Metrics

The MAD-Fact framework incorporates a hierarchical pyramid-based model for fact importance, mapping atomic claims to frequency-weighted tiers based on their mention across expert reference responses. Claims repeatedly cited receive higher weights, providing non-uniform scoring aligned with human judgment. Figure 7

Figure 7: Fact importance hierarchy model workflow, aggregating multi-model responses into a weighted pyramid for evaluation.

Weighted precision, recall, and F1 metrics are introduced, accommodating both the claim's factual status and its relative significance, with controlled recall amplification via a tunable γ\gamma hyperparameter. The metric's correlation with human ratings reaches r=0.701,p=0.036r = 0.701, p = 0.036, validating its reliability. Figure 8

Figure 8: Pearson correlation between weighted F1 (@γ=0.8\gamma=0.8) and human factuality scores.

Empirical Evaluation of MAD-Fact and LLMs

MAD-Fact is systematically evaluated on five fact-checking datasets and two long-form benchmarks (LongFact and LongHalluQA), benchmarking nine LLMs from seven model families. Figure 9

Figure 9: Factuality evaluation performance of nine selected LLMs on LongFact.

Core findings include:

  • Larger, more recent LLMs display superior factual consistency for English long-form generation. GPT-4-Turbo achieves the highest F1 among closed-source models. Domestic models such as Doubao-1.5-Pro demonstrate parity with state-of-the-art international counterparts, diverging from trends observed in short-form QA.
  • Chinese-specific LLMs, including QwQ-32B and Doubao-1.5-Pro, outperform international models on the LongHalluQA benchmark, underscoring cultural and linguistic gaps in generic LLM training. Figure 10

    Figure 10: Factuality evaluation of nine LLMs on LongHalluQA; domestic models lead performance.

  • Multi-agent debate with diversified roles and autonomous retrieval reliably surpasses single-verifier baselines (SAFE, FIRE) in overall win rate (MAD-Fact F1 win rate: 80% across label categories and datasets).
  • Ablation analyses confirm the necessity of multi-round debate and external search modules; inter-agent misleading increases when models from distinct families are combined, reducing consensus accuracy.

Case Study: Multi-Agent Human-Like Reasoning Behaviors

MAD-Fact agents exhibit nuanced behavior akin to human debate dynamics: adherence to initial beliefs (Figure 11), peer correction (Figure 12), and self-reflection leading to consensus revision (Figure 13). Figure 11

Figure 11: Case study of agents maintaining individual positions across debate rounds.

Figure 12

Figure 12: Case study where agents collaboratively identify and correct erroneous peer judgments.

Figure 13

Figure 13: Agents engage in recursive self-reflection, shifting consensus after further deliberation.

These behaviors underpin the improved system reliability when compared with rigid single-agent methodologies.

Implications and Future Directions

The MAD-Fact framework robustly addresses the central challenges in long-form LLM factuality assessment: dataset scarcity (especially in Chinese), systematic bias in single-verifier pipelines, and the uniform weighting of atomic claims. The introduction of debate protocols, claim importance weighting, and multi-agent epistemic diversity produces metrics closely aligning with human judgement. This has practical consequences for model selection, deployment, and optimization in sensitive applications, particularly where errors on core claims have outsized impacts.

Further development is warranted for real-world dataset grounding in high-risk domains, optimization of debate efficiency (reducing token/cost overhead), and mitigations against communication hallucinations leading to premature consensus on erroneous claims. Integration of confidence estimations, adversarial training, and adaptive agent weighting in debates could extend robustness.

Conclusion

This work defines a structured framework for long-form factuality evaluation, integrating a new benchmark resource (LongHalluQA), a multi-agent debate system (MAD-Fact), and claim importance-weighted metrics. Empirical analyses reveal robust performance gains, with confirmatory trends in model scaling and cultural specialization. The methodology outlined herein provides a reference foundation for future research and system development targeting reliable long-form LLM assessment, with immediate relevance for high-stakes deployment contexts.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper is about making sure long answers from AI LLMs are factually correct. When AIs write long pieces—like essays or reports—they can mix correct and incorrect information, or drift into “hallucinations” (confident but false statements). The authors propose new tools and tests to check long answers more fairly and carefully, especially in Chinese. They introduce:

  • a new Chinese long-answer dataset called LongHalluQA,
  • a multi-agent debate system called MAD-Fact (like a team debate to verify facts),
  • and “weighted” scoring that treats important facts as more valuable than minor details.

Objectives

In simple terms, the paper aims to:

  • Build a high-quality Chinese dataset for testing the factual accuracy of long AI answers.
  • Design a better checker that uses multiple AIs debating together, not just one model making judgments alone.
  • Score answers in a smarter way, giving more weight to critical facts than to fun, extra details.
  • Compare different AI models to see which ones are more reliable on long, factual writing.

How They Did It

1) Building the LongHalluQA dataset

Think of this like creating a tough test for AIs that write in Chinese:

  • Step 1: Gather trusted knowledge. They used web search to collect reliable facts related to existing questions, cleaned the results, and stored them in a structured “knowledge base.”
  • Step 2: Expand questions into long-form tasks. They turned each question into several connected sub-questions that together demand a long, fact-rich answer (like turning “Who is Li Bai?” into “Describe Li Bai’s major works, style, and influence.”).
  • Step 3: Human review. Trained reviewers checked the questions and content, removing unclear or possibly misleading items. The final dataset has 2,746 long-form Chinese samples across 7 topics (e.g., culture, science, society). On average, answers became about 9.4 times longer than in the original short datasets, creating a stronger test for long-form writing.

2) MAD-Fact: a multi-agent debate system

Imagine a team of specialists checking an essay:

  • Clerk: Breaks the long AI answer into small, checkable “atomic claims.” Think of atomic claims as Lego bricks of truth—each is a single, clear statement you can verify.
  • Jury: Several AI “roles” (like Critic, Scientist, News Author) debate each atomic claim. They can look things up (using search tools), explain their reasoning, and challenge each other.
  • Judge: Collects the jury’s votes and explanations, then decides the final TRUE/FALSE for each claim. The overall factual score is computed from these decisions.

This setup avoids relying on one model’s opinion. Instead, multiple agents cross-check each other, reducing bias and catching more mistakes.

They also tried different debate styles:

  • Autonomous: Agents choose whether to search first, then discuss.
  • Mandatory evidence: Everyone must search for evidence before speaking (more careful, less overconfidence).
  • Dynamic: If agents agree early, stop to save time; if not, force more searching to resolve conflicts.

3) Weighted scoring: treating important facts as more important

Not all facts are equal. A central fact (like “The Zhuang are the largest ethnic minority in China”) matters more than a side note (like “Zhuang brocade is a traditional textile”).

To capture this, they:

  • Built a “pyramid” of importance. They asked several strong models for reference answers, broke those into atomic claims, and counted which facts appeared most often across references. The more often a fact appears, the higher its layer in the pyramid and the greater its weight.
  • Computed weighted precision and recall. This means an answer that gets the key facts right scores higher than one that only gets trivia right. Their weighted metric strongly matched human judgments (r=0.701,p=0.036r = 0.701, p = 0.036), which suggests it reflects what people care about in long answers.

Main Findings

  • MAD-Fact (the debate system) consistently outperforms strong baselines like SAFE and FIRE on several long-form factual benchmarks. Using multiple agents and structured debates improves both precision (how often claimed facts are truly correct) and recall (how many relevant true facts are actually covered).
  • The new weighted metrics (based on fact importance) correlate well with human ratings, meaning they better capture what makes an answer “good” in the real world.
  • Bigger LLMs tend to be more factual overall on long answers.
  • Chinese-focused models (trained heavily on Chinese data) perform better on Chinese tasks.
  • The LongHalluQA dataset fills a major gap by providing a large, high-quality Chinese benchmark for long-form factual checking.

Why this matters:

  • Long answers are common in serious areas like medicine, law, and education. Measuring factuality correctly helps avoid harmful mistakes.

Implications and Impact

  • Safer AI in high-stakes fields: Hospitals, courts, and schools can use systems like MAD-Fact to vet long AI-generated texts before trusting them.
  • Better AI training and evaluation: Developers can use LongHalluQA and weighted metrics to train and test models in a way that focuses on truly important facts.
  • Smarter model selection: Teams can pick models that are better suited for long-form factual tasks, especially in specific languages like Chinese.
  • Future extensions: The multi-agent debate idea could be adapted to other languages and even multimodal content (text + images), making factual checking more robust.

In short, the paper provides practical tools and a strong framework to check long AI answers more fairly and accurately, helping reduce “hallucinations” and improve trust in AI’s output.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.