ConflictingQA: Evaluating Conflicting Evidence in QA

Updated 6 February 2026

ConflictingQA is a benchmark dataset addressing contradictory evidence in real-world QA and summarization tasks.
It employs a multi-stage process including query selection, web retrieval, and stance annotation to label conflicts.
Key evaluation metrics include exact match, F1 scores, and citation accuracy when models handle conflicting responses.

The ConflictingQA dataset is a benchmark for the evaluation of LLMs in question answering (QA) and summarization tasks characterized by real-world knowledge conflicts. Its construction, usage, and evolution have made it central to research on retrieval-augmented generation (RAG) robustness, evidence sensitivity, and conflict detection, with multiple notable instantiations and adaptations across the literature (Wan et al., 2024, Balepur et al., 1 Feb 2025, Liu et al., 2024).

1. Motivation and Core Definition

ConflictingQA was introduced to address a critical limitation of standard QA and summarization benchmarks: the assumption that a question's supporting context is internally consistent, leading to a single, unambiguous answer. In practice, information retrieval from the Web, Wikipedia, and other large corpora frequently surfaces contradictory passages. This reality is especially pronounced for contentious, ambiguous, or debatable questions, as well as so-called “unambiguous” factual questions that nonetheless yield conflicting evidence in real-world retrieval (Wan et al., 2024, Liu et al., 2024). The dataset thus operationalizes the study of:

How QA models select or aggregate answers when confronted with contradictory evidence.
The properties of text that influence model selection between conflicting passages.
Whether models can identify, represent, or summarize both sides of a knowledge conflict.

2. Dataset Construction and Annotation Methodology

The construction of ConflictingQA-type datasets typically follows a multi-stage process, comprising query selection, web or corpus retrieval, evidence labeling for stance/support, and manual or automated annotation of conflicts.

Query and Document Collection

Query Source: Controversial or ambiguous yes/no questions are drawn from web search logs, pre-existing QA datasets (e.g., AmbigQA), or constructed explicitly to ensure the presence of conflicting perspectives (Wan et al., 2024, Balepur et al., 1 Feb 2025, Liu et al., 2024).
Retrieval: For each query, multiple web pages or text snippets (mean ≈10 per query) are retrieved using commercial or academic search engines. Relevance is enforced via non-stopword overlap with the query and manual filtering.

Stance and Conflict Annotation

Stance Assignment: Each document is labeled as supporting (“Yes”), refuting (“No”), or, rarely, “Irrelevant.” LLMs (e.g., GPT-4, Claude-v1-Instant) or trained annotators assign stances based on whether a document supports or contradicts the query.
Conflict Identification: Annotators identify multiple plausible answers to the same question within the retrieved evidence set. For each answer, the supporting contexts are mapped, and annotators label the overall instance as conflicting or not, based on the presence of mutually exclusive answers (Liu et al., 2024).
Correct Answer and Rationale: For datasets focusing on single-answer accuracy, annotators select the canonical answer (usually “majority vote” or “most trustworthy source”) and provide an explanation—often with multi-label categorical choices (e.g., “majority vote”, “trustworthiness”, “recency”) and a free-text rationale.

Curation and Filtering

Only queries for which both “yes” and “no” (or multiple distinct) stances were found among credible sources are retained, ensuring a genuine conflicting evidence base (Balepur et al., 1 Feb 2025).

3. Dataset Structure and Composition

The canonical dataset contains:

290–136 queries (depending on the variant).
For each query, ∼10 evidence documents (e.g., web snippets or document paragraphs), with explicit support/refute (stance) labels.
Binary or multi-label conflict indicators—instances are marked as “conflicting” if more than one distinct, plausible answer emerges from the context set (Liu et al., 2024).
Document-to-answer mapping: for each candidate answer, the supporting context(s) are indexed.
Annotation features: explanation reasons (categorical), free-form explanations, and in some variants, summary-level or paragraph-level stance/citation attribution (Wan et al., 2024, Balepur et al., 1 Feb 2025).

Statistical properties in specific instantiations: | Property | Value | |----------------------------------|------------------------------| | Total queries | 136 (original), 290 (debate-QFS) | | Mean docs per query | ~10.5 | | Binary stance labeling | Yes / No, balanced | | Conflict rate | 25–50% per instance pool | | Context mean length | 1–2 sentences per snippet | | Inter-annotator agreement | Cohen's κ ≈ 0.72–0.90 on stance/conflict |

4. Evaluation Objectives and Metrics

ConflictingQA supports a range of evaluation scenarios, from standard answer extraction to debatable query summarization.

QA Task

Exact Match (EM) and F1 on conflicting and non-conflicting subsets, measuring model ability to recover all plausible answers in the presence of conflicting evidence (Liu et al., 2024).
Citation coverage, balance, and faithfulness (for summarization): percentage of cited documents in generated summaries, KL-divergence from ideal balanced (½,½) stances, and KL to original stance prior (Balepur et al., 1 Feb 2025).

Sensitivity Analysis

Pairwise Sensitivity: Given a "Yes" and a "No" snippet, the probability that the model votes for the intended stance, measuring which features of the document control model selection (Wan et al., 2024).
Feature Attribution: Logistic regression coefficients (β) quantify the relative importance of lexical overlap, sentiment, scientific reference presence, authority cues, etc.

5. Key Empirical Findings

A series of analyses have revealed model behaviors and deficiencies when confronted with conflicting evidence:

Relevance-Dominated Selection: LLMs overwhelmingly rely on lexical or n-gram overlap with the query in making yes/no determinations; stylistic markers such as authority, citations, or tone are largely ignored (|β_j| for overlap dominates) (Wan et al., 2024).
Balance and Faithfulness Limitations: In summarization, models often over-represent the stance present in the majority of source documents and fail to balance perspectives, unless specifically prompted or architecturally constrained (Balepur et al., 1 Feb 2025).
Partial Conflict Handling: On conflicting instances, even advanced LLMs (GPT-4o, Claude-3, Phi-3) struggle: EM and F1 drop by 10–20 points compared to non-conflicting instances; performance is lowest for "how" questions (higher conflict rate) (Liu et al., 2024).
Explanations Aid Performance: Fine-tuning models with natural-language explanations alongside context improves answer accuracy by up to 5.2 percentage points EM and 4.7 F1 (Liu et al., 2024).
Citation and Evidence Tracking: Models commonly fail to maintain alignment between answers and supporting evidence (“citation omission”), especially in multi-document or multi-perspective cases.

6. Use Cases and Dataset Variants

ConflictingQA underpins multiple research directions:

Debatable Query-Focused Summarization (DQFS): Used as evaluation for summarizers that must synthesize both “yes” and “no” evidence, maximizing document coverage and citation balance (Balepur et al., 1 Feb 2025).
Evidence Sensitivity and Retrieval Diagnostics: Enables controlled pairwise and counterfactual perturbation studies to determine what text features drive model preference (Wan et al., 2024).
Open-Domain QA with Conflicting Contexts (QACC): Adapted to measure the inherent prevalence of conflict in real-world web retrieval; shows that 25% of unambiguous questions yield conflicting answers among top-10 Google snippets (Liu et al., 2024).

7. Limitations and Prospective Directions

Known limitations and future avenues include:

Single-Annotator and Binary Labels: Many variants rely on single annotators and binary conflict labels; future expansions may include multi-rater consensus, fine-grained or severity-level conflict annotations.
Context Length and Retrieval Bias: The use of short web snippets or ranking bias from Google limits available evidence and diversity. Incorporating full passages or alternative retrieval systems (DPR, BEIR) is recommended.
Generalization to Ambiguity and Multilinguality: Most current datasets target English and treat only unambiguous questions; extension to genuinely ambiguous, multi-perspective, or cross-lingual contexts remains open (Liu et al., 2024).
Ethical and Application Considerations: Users are cautioned that true factuality cannot be guaranteed, and that balancing “both sides” is inappropriate for factual misinformation or pseudoscientific queries (Balepur et al., 1 Feb 2025).

References

"What Evidence Do LLMs Find Convincing?" (Wan et al., 2024)
"MODS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections" (Balepur et al., 1 Feb 2025)
"Open Domain Question Answering with Conflicting Contexts" (Liu et al., 2024)

Markdown Report Issue Upgrade to Chat

References (3)

What Evidence Do Language Models Find Convincing? (2024)

MODS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections (2025)

Open Domain Question Answering with Conflicting Contexts (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConflictingQA Dataset.