ConflictingQA: Evaluating Conflicting Evidence in QA
- ConflictingQA is a benchmark dataset addressing contradictory evidence in real-world QA and summarization tasks.
- It employs a multi-stage process including query selection, web retrieval, and stance annotation to label conflicts.
- Key evaluation metrics include exact match, F1 scores, and citation accuracy when models handle conflicting responses.
The ConflictingQA dataset is a benchmark for the evaluation of LLMs in question answering (QA) and summarization tasks characterized by real-world knowledge conflicts. Its construction, usage, and evolution have made it central to research on retrieval-augmented generation (RAG) robustness, evidence sensitivity, and conflict detection, with multiple notable instantiations and adaptations across the literature (Wan et al., 2024, Balepur et al., 1 Feb 2025, Liu et al., 2024).
1. Motivation and Core Definition
ConflictingQA was introduced to address a critical limitation of standard QA and summarization benchmarks: the assumption that a question's supporting context is internally consistent, leading to a single, unambiguous answer. In practice, information retrieval from the Web, Wikipedia, and other large corpora frequently surfaces contradictory passages. This reality is especially pronounced for contentious, ambiguous, or debatable questions, as well as so-called āunambiguousā factual questions that nonetheless yield conflicting evidence in real-world retrieval (Wan et al., 2024, Liu et al., 2024). The dataset thus operationalizes the study of:
- How QA models select or aggregate answers when confronted with contradictory evidence.
- The properties of text that influence model selection between conflicting passages.
- Whether models can identify, represent, or summarize both sides of a knowledge conflict.
2. Dataset Construction and Annotation Methodology
The construction of ConflictingQA-type datasets typically follows a multi-stage process, comprising query selection, web or corpus retrieval, evidence labeling for stance/support, and manual or automated annotation of conflicts.
Query and Document Collection
- Query Source: Controversial or ambiguous yes/no questions are drawn from web search logs, pre-existing QA datasets (e.g., AmbigQA), or constructed explicitly to ensure the presence of conflicting perspectives (Wan et al., 2024, Balepur et al., 1 Feb 2025, Liu et al., 2024).
- Retrieval: For each query, multiple web pages or text snippets (mean ā10 per query) are retrieved using commercial or academic search engines. Relevance is enforced via non-stopword overlap with the query and manual filtering.
Stance and Conflict Annotation
- Stance Assignment: Each document is labeled as supporting (āYesā), refuting (āNoā), or, rarely, āIrrelevant.ā LLMs (e.g., GPT-4, Claude-v1-Instant) or trained annotators assign stances based on whether a document supports or contradicts the query.
- Conflict Identification: Annotators identify multiple plausible answers to the same question within the retrieved evidence set. For each answer, the supporting contexts are mapped, and annotators label the overall instance as conflicting or not, based on the presence of mutually exclusive answers (Liu et al., 2024).
- Correct Answer and Rationale: For datasets focusing on single-answer accuracy, annotators select the canonical answer (usually āmajority voteā or āmost trustworthy sourceā) and provide an explanationāoften with multi-label categorical choices (e.g., āmajority voteā, ātrustworthinessā, ārecencyā) and a free-text rationale.
Curation and Filtering
- Only queries for which both āyesā and ānoā (or multiple distinct) stances were found among credible sources are retained, ensuring a genuine conflicting evidence base (Balepur et al., 1 Feb 2025).
3. Dataset Structure and Composition
The canonical dataset contains:
- 290ā136 queries (depending on the variant).
- For each query, ā¼10 evidence documents (e.g., web snippets or document paragraphs), with explicit support/refute (stance) labels.
- Binary or multi-label conflict indicatorsāinstances are marked as āconflictingā if more than one distinct, plausible answer emerges from the context set (Liu et al., 2024).
- Document-to-answer mapping: for each candidate answer, the supporting context(s) are indexed.
- Annotation features: explanation reasons (categorical), free-form explanations, and in some variants, summary-level or paragraph-level stance/citation attribution (Wan et al., 2024, Balepur et al., 1 Feb 2025).
Statistical properties in specific instantiations: | Property | Value | |----------------------------------|------------------------------| | Total queries | 136 (original), 290 (debate-QFS) | | Mean docs per query | ~10.5 | | Binary stance labeling | Yes / No, balanced | | Conflict rate | 25ā50% per instance pool | | Context mean length | 1ā2 sentences per snippet | | Inter-annotator agreement | Cohen's Īŗ ā 0.72ā0.90 on stance/conflict |
4. Evaluation Objectives and Metrics
ConflictingQA supports a range of evaluation scenarios, from standard answer extraction to debatable query summarization.
QA Task
- Exact Match (EM) and F1 on conflicting and non-conflicting subsets, measuring model ability to recover all plausible answers in the presence of conflicting evidence (Liu et al., 2024).
- Citation coverage, balance, and faithfulness (for summarization): percentage of cited documents in generated summaries, KL-divergence from ideal balanced (½,½) stances, and KL to original stance prior (Balepur et al., 1 Feb 2025).
Sensitivity Analysis
- Pairwise Sensitivity: Given a "Yes" and a "No" snippet, the probability that the model votes for the intended stance, measuring which features of the document control model selection (Wan et al., 2024).
- Feature Attribution: Logistic regression coefficients (β) quantify the relative importance of lexical overlap, sentiment, scientific reference presence, authority cues, etc.
5. Key Empirical Findings
A series of analyses have revealed model behaviors and deficiencies when confronted with conflicting evidence:
- Relevance-Dominated Selection: LLMs overwhelmingly rely on lexical or n-gram overlap with the query in making yes/no determinations; stylistic markers such as authority, citations, or tone are largely ignored (|β_j| for overlap dominates) (Wan et al., 2024).
- Balance and Faithfulness Limitations: In summarization, models often over-represent the stance present in the majority of source documents and fail to balance perspectives, unless specifically prompted or architecturally constrained (Balepur et al., 1 Feb 2025).
- Partial Conflict Handling: On conflicting instances, even advanced LLMs (GPT-4o, Claude-3, Phi-3) struggle: EM and F1 drop by 10ā20 points compared to non-conflicting instances; performance is lowest for "how" questions (higher conflict rate) (Liu et al., 2024).
- Explanations Aid Performance: Fine-tuning models with natural-language explanations alongside context improves answer accuracy by up to 5.2 percentage points EM and 4.7 F1 (Liu et al., 2024).
- Citation and Evidence Tracking: Models commonly fail to maintain alignment between answers and supporting evidence (ācitation omissionā), especially in multi-document or multi-perspective cases.
6. Use Cases and Dataset Variants
ConflictingQA underpins multiple research directions:
- Debatable Query-Focused Summarization (DQFS): Used as evaluation for summarizers that must synthesize both āyesā and ānoā evidence, maximizing document coverage and citation balance (Balepur et al., 1 Feb 2025).
- Evidence Sensitivity and Retrieval Diagnostics: Enables controlled pairwise and counterfactual perturbation studies to determine what text features drive model preference (Wan et al., 2024).
- Open-Domain QA with Conflicting Contexts (QACC): Adapted to measure the inherent prevalence of conflict in real-world web retrieval; shows that 25% of unambiguous questions yield conflicting answers among top-10 Google snippets (Liu et al., 2024).
7. Limitations and Prospective Directions
Known limitations and future avenues include:
- Single-Annotator and Binary Labels: Many variants rely on single annotators and binary conflict labels; future expansions may include multi-rater consensus, fine-grained or severity-level conflict annotations.
- Context Length and Retrieval Bias: The use of short web snippets or ranking bias from Google limits available evidence and diversity. Incorporating full passages or alternative retrieval systems (DPR, BEIR) is recommended.
- Generalization to Ambiguity and Multilinguality: Most current datasets target English and treat only unambiguous questions; extension to genuinely ambiguous, multi-perspective, or cross-lingual contexts remains open (Liu et al., 2024).
- Ethical and Application Considerations: Users are cautioned that true factuality cannot be guaranteed, and that balancing āboth sidesā is inappropriate for factual misinformation or pseudoscientific queries (Balepur et al., 1 Feb 2025).
References
- "What Evidence Do LLMs Find Convincing?" (Wan et al., 2024)
- "MODS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections" (Balepur et al., 1 Feb 2025)
- "Open Domain Question Answering with Conflicting Contexts" (Liu et al., 2024)