Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Published 8 Nov 2025 in cs.CR, cs.AI, and cs.CL | (2511.05919v2)

Abstract: LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.

Abstract PDF Upgrade to Chat

Summary

The paper reveals that adversarial man-in-the-middle attacks degrade LLM factual recall by injecting misleading instructions and noise into input queries.
The paper employs a Random Forest-based defense mechanism using uncertainty metrics, such as entropy and perplexity, to distinguish manipulated query responses.
The paper's experiments across varying LLM sizes and datasets highlight that larger models like GPT-4 are more prone to these attacks, emphasizing the need for robust safeguards.

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Introduction

The paper "Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs" (2511.05919) investigates the vulnerability of LLMs to adversarial attacks in information retrieval applications. Given the regulatory emphasis on AI system robustness, the analysis of LLMs as intermediaries in information delivery becomes critical, particularly in light of possible adversarial manipulations by man-in-the-middle (MitM) attacks. This study introduces a theory-grounded MitM framework, highlighting LLM susceptibility to instruction-based adversarial attacks, and suggests a simple Random Forest-based defense mechanism leveraging uncertainty metrics.

Methodology

Three prominent attacks, $\alpha$ , $\beta$ , and $\gamma$ , form the basis of the proposed $. These attacks manipulate user queries before processing by LLMs, targeting factual mitigation in closed-book settings. MitM attacks are manifested by perturbing factual contexts with either zero-shot instruction additions or noise, absent direct interference with LLM inner workings.</p> <p><strong>Attack$ \alpha $:</strong></p> <p>An instruction-based attack appends misleading directives to user queries, significantly decreasing accuracy by leading LLMs to provide incorrect answers.</p> <p><strong>Attack$ \beta and $\gamma$ :

These fact-aware attacks involve the insertion of semantically or syntactically incorrect or irrelevant factual contexts, relying on manipulating historical factual memory. Additionally, they exploit internal fact storage and generation discrepancies (Figures 5 and 6).

Figure 1: Differences in uncertainty between correct and incorrect answers against $attacks.</p></p> <h3 class='paper-heading' id='experimental-evaluation'>Experimental Evaluation</h3> <p>Comprehensive evaluation across various LLM paradigms and datasets, including TriviaQA, HotpotQA, and Natural Questions, demonstrates accuracy degradation. Larger LLMs like <a href="https://www.emergentmind.com/topics/llm-judge-gpt-4o" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">GPT-4o</a> are substantially impacted by$ \alpha $due to a predisposition for instruction adherence. In contrast, smaller models exhibit variability in attack resilience, underscoring size and methodological influence on attack susceptibility (Figure 2). <img src="https://emergentmind-storage-cdn-c7atfsgud9cecchk.z01.azurefd.net/paper-images/2511-05919/aucroc_all_models.png" alt="Figure 2" title="" class="markdown-image" loading="lazy"> <p class="figure-caption">Figure 2: AUC-ROC performance for all classifiers on the uncertainty levels of responses from different LLMs.</p></p> <h3 class='paper-heading' id='discussion'>Discussion</h3> <p>Nuanced differences in uncertainty metrics, notably entropy and perplexity, underscore discernible discrepancies between correct and incorrect response generation. By training classifiers on these uncertainty profiles, end users can be alerted of possible query interception and alteration. Yet, substantial granularities persist in detection accuracy across models, with particular difficulty in collectively detecting$ \alpha $,$ \beta $, and$ without differentiated vulnerability metrics.

Conclusion

This study outlines the latent vulnerabilities of LLMs to MitM adversarial attacks, emphasizing the necessity for augmented cybersecurity measures within AI information retrieval applications. Future research should explore sophisticated adversarial methods and defense mechanisms beyond basic uncertainty alerts, ensuring broader applicability and scalability in real-world settings. Ensuring the integrity of AI systems against adversarial threats remains vital for maintaining user trust and system accountability in high-risk domains.