Swapped-Reference QA Framework

Updated 19 January 2026

The paper introduces a systematic method that swaps references to detect and measure reliability gaps in LLM-based QA evaluation.
The framework employs controlled swapping of gold references with conflicting entities to assess if judges adhere to provided context or default to internal knowledge.
Empirical results show significant variations in Reference-Polarity Accuracy Gap (RPAG) across different model variants and dataset types, highlighting critical reliability issues.

The swapped-reference QA framework is a controlled protocol designed to diagnose and quantify reference-belief conflicts in LLM-based judges used for question answering (QA) and similar evaluation scenarios. It systematically alters the gold reference used during QA evaluation—by replacing it with an entity conflicting with the original—to probe whether an automatic judge adheres faithfully to the provided reference or instead defaults to parametric knowledge encoded in its weights. This methodology is foundational for identifying and measuring a critical failure mode in automatic evaluation: unreliable scoring when the candidate answer and provided reference disagree due to the judge’s internal knowledge (Lee et al., 12 Jan 2026).

1. Formal Definition and Mathematical Notation

The swapped-reference QA framework operates over a dataset $\{(q_i, r_i^o)\}_{i=1}^N$ of $N$ question-reference answer pairs, where $q_i$ is the question, and $r_i^o$ is the original (gold standard) answer. For each instance, a “swapped” reference $r_i^s$ is introduced such that $r_i^s \neq r_i^o$ and typically conflicts with the original on factual grounds. For every reference $a \in \{o,s\}$ , a candidate answer $c_i^a$ is generated to be semantically aligned with the selected reference.

Automatic judge verdicts are defined: $V_i^{a,b} = LLM(q_i, r_i^a, c_i^b)$ returning one of $\{\text{Correct},\,\text{Incorrect}\}$ . Ground-truth is “Correct” for $a=b$ and “Incorrect” for $a \neq b$ . An alignment indicator is introduced: $\Delta_{i}^{a,b} = \begin{cases} 1 & \text{if } a=b \ 0 & \text{if } a \neq b \end{cases}$

2. Dataset Construction and Swapping Strategies

Construction of the swapped-reference dataset involves four main stages:

Named-Entity Recognition & Type Labeling: Each original reference $r_i^o$ is processed with an LLM-based NER pipeline using SpaCy labels to assign an entity type $T_i$ (PERSON, GPE, etc.), with manual quality checks for a subset of the dataset.
Reference Swapping: Methods include:
- Type-Preserving Swap (TP): Swap $r_i^o$ with a different entity of the same type.
- Type-Changing Swap (TC): Swap $r_i^o$ with an entity of a different type.
- Popularity-High/Low Swap: Swap only among PERSON entities based on Wikipedia pageviews, creating high-popularity or low-popularity scenarios.
- Evaluator-Knowledge Swap (EK): For each judge, set $r_i^s$ to the model’s top prediction under vanilla QA prompting, forcing reference alignment with internal beliefs.
Long-Form Candidate Generation: Each $c_i^a$ is crafted by prompting an LLM (GPT-4o, $T=0$ ) to fluently answer $q_i$ with $r_i^a$ as the presumed ground-truth. $c_i^o$ typically consists of several sentences, while $c_i^s$ is designed as a concise, single-sentence answer.
Meta-Evaluation Triplets: Four triplets per instance: $(q_i, r_i^o, c_i^o)$ (Correct), $(q_i, r_i^o, c_i^s)$ (Incorrect), $(q_i, r_i^s, c_i^o)$ (Incorrect), $(q_i, r_i^s, c_i^s)$ (Correct). These form the basis for scoring reliability.

3. Metrics and Reliability Measurement

Primary evaluation metrics are:

Accuracy Under Reference Conditions:

$\mathrm{ACC}^a = \frac{1}{N}\sum_{i=1}^N \bigl[\mathbf{1}(V_i^{a,a} = \text{Correct}) + \mathbf{1}(V_i^{a,\bar a} = \text{Incorrect})\bigr]$

where $a \in \{o, s\}$ and $\bar a$ is the alternate reference type.

Reference-Polarity Accuracy Gap (RPAG):

$\mathrm{RPAG} = \mathrm{ACC}^o - \mathrm{ACC}^s$

RPAG quantifies the degradation in judge reliability under swapped conditions.

Supplementary approaches sometimes include correlation with human scores, self-consistency variance, and adversarial reliability drop rates, though these are not the focus of the swapped-reference protocol.

4. Empirical Observations and Failure Modes

Comprehensive evaluation of thirteen LLM judge variants (GPT-4o, GPT-4.1, GPT-5, Qwen-2.5/3, Llama-3.1) reveals substantial accuracy drops upon swapping references. Notably, even GPT-5 suffers RPAGs up to 12 points (ACC $^o \approx 98.4$ vs ACC $^s \approx 86.1$ under type-changing swaps). Type-changing swaps induce larger reliability gaps than type-preserving swaps.

The magnitude of RPAG varies by dataset: scientific fact QA (SciQ) yields the steepest reliability decline (up to −60 points), while entity-heavy datasets (PopQA) range from −13 to −49. Increase in model scale does not mitigate the failure; in fact, some model families demonstrate amplified vulnerability with size escalation.

5. Root Causes: Parametric Knowledge Conflict

The core failure arises when the parametric knowledge stored in the judge model overrides explicit adherence to the provided swapped reference. If the swapped reference $r_i^s$ matches the judge’s own top prediction (Evaluator-Knowledge Swap), reliability is restored ( $\mathrm{RPAG} \approx 0$ ). However, when reference conflicts with established internal beliefs, the judge disregards $r_i^s$ and reverts to its parametric memory, irrespective of explicit instructions.

Pre-existing entity popularity aggravates the effect: high-popularity swaps (e.g., Queen Elizabeth II) trigger larger RPAGs. On datasets probing never-changing versus fast-changing facts (FreshQA), models more easily override context for canonical knowledge, maintaining higher RPAG for “static” facts (This suggests reference override is modulated by memory strength).

A concrete illustration:

Question: “Where did the tea come from in the Boston Tea Party?”

$r^o =$ England, $c^o =$ “…came from England.”

$r^s =$ Paris, $c^s =$ “…came from Paris.”

GPT-4o returns “Incorrect” for $(r^s, c^s)$ even though the candidate perfectly matches the swapped reference.

6. Mitigation Strategies and Limitations

Prompt engineering approaches—standard instructions, direct “must trust the given gold target,” chain-of-thought (CoT), self-consistency with majority voting, and hybrids—fail to close RPAG. The direct prompting strategy marginally reduces the accuracy gap (e.g., from 37 to 22 points on NQ-Open Type-Preserving swaps), but substantial unreliability persists. CoT prompts often exacerbate the effect, as the model rationalizes internally and ignores explicit references. Majority voting strengthens parametric bias rather than enforcing rule adherence.

7. Proposed Advancements and Future Research

Recommendations for improving judge fidelity under swapped-reference settings include:

Deploying constrained decoding or checkpoint tokens that prohibit factual assertions outside the context of the given reference.
Implementing two-stage pipelines: explicit context verification followed by strict string-matching evaluation.
Flagging high-RPAG cases for hybrid human–LLM review, blending statistical screening with expert judgment.
Extending diagnostic frameworks beyond QA to areas like summarization, fact verification, and dialogue evaluation.
Fine-tuning judge models on synthetic swapped-reference corpora to harden “rule-following” even under reference-belief conflict conditions.
Investigating attention and embedding patterns to elucidate mechanisms of context suppression during reference override.

A plausible implication is that robust automatic evaluation in reference-conditioned tasks requires architectures or protocols capable of absolute contextual adherence under adversarially conflicting scenarios, a property not naturally emergent from current LLM pretraining pipelines (Lee et al., 12 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Swapped-Reference QA Framework.