Swapped-Reference QA Framework
- The paper introduces a systematic method that swaps references to detect and measure reliability gaps in LLM-based QA evaluation.
- The framework employs controlled swapping of gold references with conflicting entities to assess if judges adhere to provided context or default to internal knowledge.
- Empirical results show significant variations in Reference-Polarity Accuracy Gap (RPAG) across different model variants and dataset types, highlighting critical reliability issues.
The swapped-reference QA framework is a controlled protocol designed to diagnose and quantify reference-belief conflicts in LLM-based judges used for question answering (QA) and similar evaluation scenarios. It systematically alters the gold reference used during QA evaluation—by replacing it with an entity conflicting with the original—to probe whether an automatic judge adheres faithfully to the provided reference or instead defaults to parametric knowledge encoded in its weights. This methodology is foundational for identifying and measuring a critical failure mode in automatic evaluation: unreliable scoring when the candidate answer and provided reference disagree due to the judge’s internal knowledge (Lee et al., 12 Jan 2026).
1. Formal Definition and Mathematical Notation
The swapped-reference QA framework operates over a dataset of question-reference answer pairs, where is the question, and is the original (gold standard) answer. For each instance, a “swapped” reference is introduced such that and typically conflicts with the original on factual grounds. For every reference , a candidate answer is generated to be semantically aligned with the selected reference.
Automatic judge verdicts are defined: returning one of . Ground-truth is “Correct” for and “Incorrect” for . An alignment indicator is introduced:
2. Dataset Construction and Swapping Strategies
Construction of the swapped-reference dataset involves four main stages:
- Named-Entity Recognition & Type Labeling: Each original reference is processed with an LLM-based NER pipeline using SpaCy labels to assign an entity type (PERSON, GPE, etc.), with manual quality checks for a subset of the dataset.
- Reference Swapping: Methods include:
- Type-Preserving Swap (TP): Swap with a different entity of the same type.
- Type-Changing Swap (TC): Swap with an entity of a different type.
- Popularity-High/Low Swap: Swap only among PERSON entities based on Wikipedia pageviews, creating high-popularity or low-popularity scenarios.
- Evaluator-Knowledge Swap (EK): For each judge, set to the model’s top prediction under vanilla QA prompting, forcing reference alignment with internal beliefs.
- Long-Form Candidate Generation: Each is crafted by prompting an LLM (GPT-4o, ) to fluently answer with as the presumed ground-truth. typically consists of several sentences, while is designed as a concise, single-sentence answer.
- Meta-Evaluation Triplets: Four triplets per instance: (Correct), (Incorrect), (Incorrect), (Correct). These form the basis for scoring reliability.
3. Metrics and Reliability Measurement
Primary evaluation metrics are:
- Accuracy Under Reference Conditions:
where and is the alternate reference type.
- Reference-Polarity Accuracy Gap (RPAG):
RPAG quantifies the degradation in judge reliability under swapped conditions.
Supplementary approaches sometimes include correlation with human scores, self-consistency variance, and adversarial reliability drop rates, though these are not the focus of the swapped-reference protocol.
4. Empirical Observations and Failure Modes
Comprehensive evaluation of thirteen LLM judge variants (GPT-4o, GPT-4.1, GPT-5, Qwen-2.5/3, Llama-3.1) reveals substantial accuracy drops upon swapping references. Notably, even GPT-5 suffers RPAGs up to 12 points (ACC vs ACC under type-changing swaps). Type-changing swaps induce larger reliability gaps than type-preserving swaps.
The magnitude of RPAG varies by dataset: scientific fact QA (SciQ) yields the steepest reliability decline (up to −60 points), while entity-heavy datasets (PopQA) range from −13 to −49. Increase in model scale does not mitigate the failure; in fact, some model families demonstrate amplified vulnerability with size escalation.
5. Root Causes: Parametric Knowledge Conflict
The core failure arises when the parametric knowledge stored in the judge model overrides explicit adherence to the provided swapped reference. If the swapped reference matches the judge’s own top prediction (Evaluator-Knowledge Swap), reliability is restored (). However, when reference conflicts with established internal beliefs, the judge disregards and reverts to its parametric memory, irrespective of explicit instructions.
Pre-existing entity popularity aggravates the effect: high-popularity swaps (e.g., Queen Elizabeth II) trigger larger RPAGs. On datasets probing never-changing versus fast-changing facts (FreshQA), models more easily override context for canonical knowledge, maintaining higher RPAG for “static” facts (This suggests reference override is modulated by memory strength).
A concrete illustration:
Question: “Where did the tea come from in the Boston Tea Party?”
- England, “…came from England.”
- Paris, “…came from Paris.”
GPT-4o returns “Incorrect” for even though the candidate perfectly matches the swapped reference.
6. Mitigation Strategies and Limitations
Prompt engineering approaches—standard instructions, direct “must trust the given gold target,” chain-of-thought (CoT), self-consistency with majority voting, and hybrids—fail to close RPAG. The direct prompting strategy marginally reduces the accuracy gap (e.g., from 37 to 22 points on NQ-Open Type-Preserving swaps), but substantial unreliability persists. CoT prompts often exacerbate the effect, as the model rationalizes internally and ignores explicit references. Majority voting strengthens parametric bias rather than enforcing rule adherence.
7. Proposed Advancements and Future Research
Recommendations for improving judge fidelity under swapped-reference settings include:
- Deploying constrained decoding or checkpoint tokens that prohibit factual assertions outside the context of the given reference.
- Implementing two-stage pipelines: explicit context verification followed by strict string-matching evaluation.
- Flagging high-RPAG cases for hybrid human–LLM review, blending statistical screening with expert judgment.
- Extending diagnostic frameworks beyond QA to areas like summarization, fact verification, and dialogue evaluation.
- Fine-tuning judge models on synthetic swapped-reference corpora to harden “rule-following” even under reference-belief conflict conditions.
- Investigating attention and embedding patterns to elucidate mechanisms of context suppression during reference override.
A plausible implication is that robust automatic evaluation in reference-conditioned tasks requires architectures or protocols capable of absolute contextual adherence under adversarially conflicting scenarios, a property not naturally emergent from current LLM pretraining pipelines (Lee et al., 12 Jan 2026).