Lost in the Noise: How Reasoning Models Fail with Contextual Distractors

Published 12 Jan 2026 in cs.AI and cs.CL | (2601.07226v1)

Abstract: Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information. However, this shift introduces input contexts that are inherently noisy, a reality that current sanitized benchmarks fail to capture. We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks against diverse noise types, including random documents, irrelevant chat histories, and hard negative distractors. Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors. Crucially, we find that agentic workflows often amplify these errors by over-trusting noisy tool outputs, and distractors can trigger emergent misalignment even without adversarial intent. We find that prompting, context engineering, SFT, and outcome-reward only RL fail to ensure robustness; in contrast, our proposed Rationale-Aware Reward (RARE) significantly strengthens resilience by incentivizing the identification of helpful information within noise. Finally, we uncover an inverse scaling trend where increased test-time computation leads to worse performance in noisy settings and demonstrate via attention visualization that models disproportionately focus on distractor tokens, providing vital insights for building the next generation of robust, reasoning-capable agents.

Abstract PDF Upgrade to Chat

Summary

The paper reveals that reasoning models suffer up to 80% performance drop in noisy contexts due to distractors.
It introduces NoisyBench and a Rationale-Aware Reward mechanism, achieving up to 306% robustness improvement under severe distractor conditions.
Findings expose that current methods like prompting, SFT, and standard RL are inadequate for filtering distractors in agentic workflows.

Robustness Failures of Reasoning Models under Contextual Distractors

Overview and Motivation

The paper "Lost in the Noise: How Reasoning Models Fail with Contextual Distractors" (2601.07226) presents an extensive empirical and analytical study of the vulnerability of large reasoning models (LRMs) and agentic AI systems to varied forms of context noise and distractors. Motivated by the disconnect between sanitized benchmark evaluations and the realities of deployment environments—where models interact with noisy, uncurated external sources—the authors introduce NoisyBench, a benchmark suite designed to quantitatively measure model robustness across RAG, reasoning, alignment, and tool-use tasks under four distinct distractor types. Strong claims emerge: existing state-of-the-art models suffer dramatic performance degradation, up to 80% accuracy loss, when exposed to realistic distractors, with agentic workflows and current context engineering protocols amplifying these failures rather than mitigating them.

Figure 1: NoisyBench contrasts performance on clean inputs versus noisy contexts, revealing substantial accuracy drops and systematic failures in reasoning, alignment, and robust retrieval.

Experimental Framework: NoisyBench Construction and Distractor Typology

NoisyBench is constructed to directly probe the robustness of models to noise. It spans eleven datasets across four categories: RAG (SealQA, MultihopRAG, Musique), reasoning (BBEH-Mini, AIME25, GPQA-Diamond), alignment (Self-Awareness, Survival-Instinct, BBQ), and tool usage (TauBench). For each sample, four context variants are generated:

ND: No distractor (standard benchmark, control)
RD: Random document distractor (simulating retrieval errors)
RC: Random chat history (multi-turn conversational noise)
HN: Synthetic hard negative distractor (plausibly related but orthogonal)

Distractors are rigorously filtered to prevent leakage or trivialization, ensuring the correct answer remains invariant and no direct evidence is present within the distractor text.

Catastrophic Failures Induced by Distractors

Comprehensive evaluation reveals average performance falls of 9%–80% across diverse models (from Gemini-2.5-Pro down to Qwen3-4B). Hard negative distractors consistently induce the strongest interference. Importantly, emergent misalignment arises not from adversarial attacks, but from random distractors: alignment accuracy on BBQ falls from 94.0% (ND) to 60.5% (HN) for Gemini-2.5-Pro. Lesser models suffer even more pronounced drops, indicating that high clean accuracy does not correlate with robustness to noise.

Agentic workflows—multi-step tool-augmented reasoning with external calls—yield increased error rates under noise. They exacerbate compounding of misleading trajectories, originating from the model’s over-trust in contaminated tool outputs and contextual evidence.

Figure 2: Agentic workflows that leverage multi-turn tool calls enhance performance in sanitized conditions yet degrade under distractor-rich environments.

Limitations of Prompting, Context Engineering, SFT, and RL

Attempts to harden models—prompting (including CiC), context engineering (GEPA, DC, ACE), SFT on NoisyInstruct, and standard RL with outcome-based reward—do not yield meaningful robustness gains. Prompting and context engineering methods fail to suppress distractors; SFT induces catastrophic forgetting, reducing even baseline noise resistance. RL with strictly outcome-reward shows marginal improvement, constrained by its granularity: it cannot distinguish between successful reasoning grounded in context versus answers generated by memorized parameters or spurious reward events.

Figure 3: Context engineering variants deliver negligible improvements under noise, demonstrating fundamental limitations of current context adaptation protocols.

Rationale-Aware Reward: Directly Incentivizing Distractor Filtering

To overcome granularity limitations in RL, the paper introduces the Rationale-Aware Reward (RARE) mechanism. RARE grants fine-grained rewards when the model identifies and utilizes helpful information from within the noisy context, as defined by chain-of-thought spans that overlap with gold references. Empirically, RL+RARE outperforms RL with outcome-only rewards—delivering up to 306% improvement over baseline in severe distractor-dominated scenarios and facilitating explicit suppression of context-induced confusion.

Figure 4: RL with RARE iteratively reduces distracted reasoning chains and maximizes final answer accuracy, surpassing outcome-only RL.

Analytical Insights: Scaling Laws, Attention, and Reasoning Dynamics

A set of targeted analyses elucidates why reasoning models fail under distractors:

Similarity Scaling: As semantic similarity between question and distractor increases, accuracy monotonically decreases, and reasoning token usage increases, signaling confusion not explained by simple context-length effects.
Figure 5: Higher distractor similarity consistently correlates with lower accuracy and increased token usage, implicating active confusion rather than just haystack effects.
Output Dynamics: Length of generated reasoning chains is only weakly tied to distractor input length; increases are driven by distractor similarity, not input length itself.
Figure 6: Output length displays minimal correlation with distractor length; complexity arises from distractor-induced confusion rather than longer context alone.
Confidence and Entropy: Presence and quantity of distractors elevate entropy of model outputs, signifying lowered confidence and increased uncertainty.
Figure 7: Generation entropy increases with additional distractors, indicative of growing uncertainty in model prediction.
Attention Analysis: Incorrect answers are characterized by inordinate focus on distractor tokens; attention maps substantiate the failure of current attention mechanisms to suppress irrelevant input.
Figure 8: Erroneous outputs assign substantially more attention to distractors, demonstrating the detrimental impact of noisy context on token-level reasoning.

Model Scaling, Mixtures, Transferability, and Safety Implications

Follow-on analyses show that scaling model size improves robustness only sublinearly; increasing parameter count offers limited additive protection. Mixing distractor types (RD+RC+HN) further aggravates performance collapse, even when input length is normalized. Models trained with RARE not only gain robustness under noisy settings but also transfer improvements to clean contexts, indicating generalizable resilience. Intriguingly, the addition of distractors sometimes improves safety/jailbreak detection by amplifying adversarial signals, while refusal rates on harmless inputs remain stable or fall, contradicting assumptions that context noise systematically undermines safety.

Figure 9: Safety refusal rates under distractor-rich settings demonstrate that noise does not universally weaken guardrails and may even fortify jailbreak detection.

Theoretical and Practical Implications

The findings contradict prevailing assumptions propagated by clean benchmarks: strong performance in curated conditions is not predictive of robustness to realistic context noise. The inverse scaling law under noisy conditions indicates that test-time computation and longer reasoning chains typically degrade performance instead of enhancing it. Failure modes—reasoning confusion, attention misallocation, and emergent misalignment—demand new architectural, pretraining, and RL paradigms that reward grounded, source-aware reasoning and explicitly suppress distractors.

Practical implications center on the necessity of robust distractor handling for agentic systems in domains such as healthcare, finance, and legal automation. Integration of rationale-aware feedback mechanisms, improved context curation, and adversarial filtering protocols become critical for deployment. This work also highlights transferability issues, suggesting robust training protocols generalize to both noisy and clean environments.

Future Directions

Robustness to multimodal distractors, generalization in open-ended tool-use, meta-reasoning for context quality estimation, and dynamic attention gating mechanisms constitute promising research avenues. The impact of distractors on emergent behaviors and the safety/jailbreak axis requires further scrutiny, especially as agentic systems evolve toward greater autonomy and memory capacity.

Conclusion

This paper exposes fundamental weaknesses in the reasoning and alignment capabilities of current LRMs and agentic systems under the omnipresent condition of contextual noise. The introduction of NoisyBench and RARE establishes critical new standards for benchmarking and reward engineering. The demonstrated limitations of prompting, SFT, and context engineering—contrasted with the effectiveness of rationale-aware RL—mandate explicit distractor filtering and process-level supervision for future robust agentic AI architectures.