DUPE: Deduction under Perturbed Evidence
- Deduction under Perturbed Evidence (DUPE) is a framework that minimally perturbs evidence to reverse truth values and test if inference models rely on explicit input over internal priors.
- It integrates Boolean logic, Dempster-Shafer theory, and belief-update mechanisms to quantify uncertainty in both natural language and numerical perturbations.
- Empirical results show models like GPT-4 suffer significant accuracy drops under perturbed conditions, highlighting the need for robust prompt strategies and advanced belief updates.
Deduction under Perturbed Evidence (DUPE) is a formal framework that addresses logical reasoning where evidence supporting a conclusion is deliberately distorted or minimally perturbed. It directly tests whether inference mechanisms—especially those in LLMs—can override their internal parameterized priors and reason only according to explicit, possibly incorrect, input evidence. DUPE is critical for evaluating systems in student simulation models, safety-critical reasoning, and settings where evidence may be unreliable, thus exposing fundamental limitations in current neural architectures’ ability to track and act upon false or perturbed premises.
1. Formal Definition and Core Principles
DUPE involves transforming a Boolean question-fact-answer triple , with , into a new triple such that the answer is logically flipped:
Here, is the string edit distance, with threshold typically set to 2–3 word edits to preserve the original contextual structure. The deduction process under DUPE challenges inference modules to generate conclusions that are correct under , even and especially when those contradict global prior knowledge encoded in the model’s parameters (Sonkar et al., 2023).
In evidential frameworks such as Dempster-Shafer theory, DUPE can be modeled as a belief update where perturbations correspond to increased interval-valued uncertainty for events and conditionals. The update is computed by reallocating mass assignments in a single pass:
This integrates both modus ponens and modus tollens reasoning, propagating evidence perturbations throughout the deductive structure (Ruspini, 2013).
2. Construction of DUPEd Datasets
The prototypical DUPEd dataset is constructed from AllenAI’s StrategyQA corpus. Original sets are transformed by manual minimal edits to evidence facts , producing with so that the answer label . For the DUPEd-StrategyQA benchmark:
- 325 examples were curated:
- 173 feature natural-language fact perturbations (semantics-preserving edits)
- 152 use mathematical or numerical fact perturbations
Such datasets maintain the overall context of the original question while inverting the truth value of the answer, isolating the model’s ability to reason accurately under manipulated evidence (Sonkar et al., 2023).
3. Experimental Evaluation Methodology
Models evaluated include GPT-3.5 (gpt-3.5-turbo-0301) and GPT-4 (gpt-4-0314). Two key prompting regimes were used:
- P1 (QA model): “You are a question answering model. Reason on provided evidence to answer a YES or NO question.”
- P2 (Student simulation): “You are a student simulation model. Reason on a student’s (possibly incorrect) responses to predict their answer to a YES or NO question.”
Evaluation protocol:
- Accuracy on original StrategyQA:
- Accuracy on DUPEd-StrategyQA:
- Accuracy drop:
Results are also partitioned by type of perturbation (language vs. numeric) (Sonkar et al., 2023).
| Model | Prompt | acc_orig | acc_pert | Δ_acc |
|---|---|---|---|---|
| GPT-3.5 | P1 | 84.6% | 38.6% | 46.0 |
| GPT-4 | P1 | 91.9% | 46.7% | 45.2 |
| GPT-4 | P2 | 91.9% | 62.7% | 29.2 |
4. Findings and Model Analysis
Under the standard QA prompt (P1), both GPT-3.5 and GPT-4 showed pronounced failure to deduce under perturbed evidence, suffering 45% absolute drops in accuracy. The student simulation prompt (P2) partially improved robustness (GPT-4 16 points), but the accuracy drop on perturbed evidence remained high at 29.2%.
A breakdown by perturbation type reveals that even advanced LLMs are less robust to language-based perturbations than to numerical ones: GPT-4 experienced a 39–50% drop in mathematical cases versus 50–59% for language fact perturbations.
Root-cause analysis attributes these failures to the dominance of parameterized knowledge (model “priors”) over prompt-supplied evidence. Internal associations remain strong, as illustrated by research into transformer key–value memory layers and model editing interventions such as ROME. Overcoming these parameter priors to obey explicit, even minimally changed, input evidence constitutes a core challenge for current LLM architectures (Sonkar et al., 2023).
5. Mitigation Strategies and Belief-Oriented Approaches
Prompt engineering inspired by student simulation—explicitly signaling to the model that evidence may be inaccurate—yields partial mitigation. GPT-4’s DUPEd accuracy rose from 46.7% to 62.7% under this regime. However, no other mitigation strategies (e.g., chain-of-thought, few-shot) were tested by the original authors.
In contrast, principled belief-oriented update frameworks, such as Ruspini’s single-pass mass reallocation for Dempster-Shafer evidence fusion, offer a formal mechanism to reflect and propagate evidence perturbations:
- Each uncertain or perturbed probability is represented as an interval in the basic probability assignment (bpa).
- Conditional estimates are handled by corresponding conditional bpa’s.
- A supremum-of-infimum over partitions enforces the tightest consistent lower bound, ensuring both forward (modus ponens) and backward (modus tollens) propagation of uncertainty.
- No iterative recyling is required—the process is computationally tractable for small frames.
This approach natively supports DUPE regimes by modeling misinformation as increased interval uncertainty and propagating these effects tightly through the inference process (Ruspini, 2013).
6. Implications for Student Simulation, Safety, and Future Work
The DUPE paradigm holds immediate significance for student simulation models in intelligent tutoring systems, where the goal is to accurately predict reasoning over student responses that may contain misconceptions or deliberate errors. Models incapable of overriding parameter priors to “believe incorrect evidence” will fail in these contexts.
From a broader perspective, over-reliance on pre-trained priors can yield unsafe behavior and alignment failures in real-world applications, especially when system inputs may contradict knowledge encoded during model development. Hallucination and brittleness arise when LLMs ignore user-supplied evidence.
A plausible implication is that advancing LLM architectures for DUPE robustness will require new mechanisms—prompt designs, training regimens (e.g., adversarially augmented datasets), or direct model editing—that shift inference authority from parameters to explicit input evidence.
7. Connections to Evidential Reasoning and Interval Uncertainty
Ruspini’s formulation for approximate deduction in evidential bodies gives a mathematical foundation for DUPE by directly accommodating perturbed or interval-valued evidence. Key features:
- Both “hard” evidence and conditionals can be modeled as intervals or belief functions.
- The update mechanism guarantees tight belief and plausibility values for queries under radical evidence perturbation.
- The dual-directional propagation (both modus ponens and modus tollens) is computed in a single supremum–infimum pass over focal sets and their partitions.
This formalism establishes a bridge for extending DUPE evaluation beyond LLMs to probabilistic and evidential AI systems, highlighting the need for inference engines that can reliably track and act on explicitly perturbed inputs without defaulting to stored knowledge (Ruspini, 2013).
In summary, Deduction under Perturbed Evidence reveals crucial limitations in current reasoning paradigms, both neural and probabilistic, and motivates further research into both principled belief-update mechanisms and model interventions as foundational elements for future robust inference systems.