PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization

Published 10 Apr 2025 in cs.CR and cs.AI | (2504.07717v3)

Abstract: LLMs have demonstrated remarkable performance across a wide range of applications, e.g., medical question-answering, mathematical sciences, and code generation. However, they also exhibit inherent limitations, such as outdated knowledge and susceptibility to hallucinations. Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm to address these issues, but it also introduces new vulnerabilities. Recent efforts have focused on the security of RAG-based LLMs, yet existing attack methods face three critical challenges: (1) their effectiveness declines sharply when only a limited number of poisoned texts can be injected into the knowledge database, (2) they lack sufficient stealth, as the attacks are often detectable by anomaly detection systems, which compromises their effectiveness, and (3) they rely on heuristic approaches to generate poisoned texts, lacking formal optimization frameworks and theoretic guarantees, which limits their effectiveness and applicability. To address these issues, we propose coordinated Prompt-RAG attack (PR-attack), a novel optimization-driven attack that introduces a small number of poisoned texts into the knowledge database while embedding a backdoor trigger within the prompt. When activated, the trigger causes the LLM to generate pre-designed responses to targeted queries, while maintaining normal behavior in other contexts. This ensures both high effectiveness and stealth. We formulate the attack generation process as a bilevel optimization problem leveraging a principled optimization framework to develop optimal poisoned texts and triggers. Extensive experiments across diverse LLMs and datasets demonstrate the effectiveness of PR-Attack, achieving a high attack success rate even with a limited number of poisoned texts and significantly improved stealth compared to existing methods.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a bilevel optimization framework that coordinates prompt injections and poisoned text to boost attack success and stealth in RAG systems.
It demonstrates robust performance across LLM architectures, achieving high attack success rates even with minimal poisoned text injections.
The framework maintains normal response accuracy during non-trigger states, highlighting significant implications for reinforcing AI security defenses.

Coordinated Prompt-RAG Attacks on LLMs

Introduction

The integration of Retrieval-Augmented Generation (RAG) with LLMs has become increasingly prevalent due to its potential to address the inherent limitations of LLMs, such as hallucinations and outdated information. The innovative mechanism of RAG combines LLMs with external retrieval systems, enhancing the capability to provide accurate and up-to-date responses. However, this integration introduces new security vulnerabilities that adversaries could exploit.

Concept and Implementation of PR-Attack

The paper introduces the coordinated Prompt-RAG (PR) attack framework, a novel approach that leverages a bilevel optimization process to enhance the attack effectiveness against RAG-based LLMs. The PR-attack paradigm stems from recognizing that existing methods suffer from inefficiencies when limited poisoned texts can be injected into the knowledge database. These methods also struggle with poor stealth due to detectable anomalies and rely on heuristic-based solutions that lack formal optimization techniques.

The PR-attack method formulates the attack as a bilevel optimization problem involving two critical components: the optimization of prompt injections and the generation of poisoned texts to be injected into the knowledge base. The proposed attack process is structured as follows:

Poisoned Text Injection: A limited number of strategically crafted poisoned texts are injected into the knowledge database.
Trigger Activation: A backdoor trigger is embedded within the prompt to initiate the LLM’s response deviation during sensitive periods without affecting normal operations.
Stealth Maintenance: The design ensures that during normal periods, the backdoor trigger remains inactive, preventing users from realizing the compromise.
Figure 1: Overview of the proposed PR-attack, illustrating the stealth mechanism employed during both sensitive and normal periods.

Bilevel Optimization Framework

The core of the PR-attack methodology is in the bilevel optimization framework. The upper-level optimization focuses on ensuring that when the backdoor trigger is activated, the LLM generates the maliciously intended target answer. Simultaneously, when the trigger is inactive, the LLM should produce the correct answer to maintain high stealth.

The lower-level optimization addresses the retrieval problem, where the task is to select the top-k most relevant texts from the poisoned knowledge base for each target query. The intricate design of this framework facilitates highly effective and stealthy attack mechanisms.

Experimental Evaluation

Extensive experiments demonstrate the superiority of the PR-attack across various datasets and LLM architectures, surpassing existing methods like PoisonedRAG and GGPP in terms of both attack success rate (ASR) and stealth. The PR-attack achieves formidable ASRs even when constrained to a minimal number of poisoned texts, showcasing its robust applicability and effectiveness.

Figure 2: Performance comparison of PR-attack against state-of-the-art methods, demonstrating elevated ASR across multiple LLMs and datasets.

The assessment also highlights that PR-attack maintains an ACC comparable to non-poisoned scenarios when the trigger remains inactive, further reinforcing the argument for its advanced stealth capabilities.

Figure 3: Illustration of PR-attack's stealth effectiveness, maintaining high ACC when the trigger is not activated across tested LLMs.

Sensitivity Analysis

In-depth analysis indicates that the performance of the PR-attack remains stable across variations in parameters such as the length of poisoned texts (b) and the number of tunable tokens in soft prompt (n). This robustness empowers the PR-attack with flexibility and reliability across changing configurations and deployment scenarios in real-world applications.

Figure 4: Examining the impact of variations in the length of poisoned text $b$ on PR-attack performance.

Figure 5: Exploring how changes in the number of tunable tokens $n$ influence the efficacy of PR-attack.

Conclusion

The introduction of the PR-attack method, with its novel bilevel optimization framework, enhances the execution of coordinated attacks on RAG-based LLMs, presenting an elevated threat level in adversarial settings. The framework’s capacity for indirect activation and deactivation contributes significantly to stealth, a key element for evading detection in deployed systems. As AI continues to integrate more deeply into diverse domains, the implications of these findings necessitate a concerted focus on developing more robust defenses against such sophisticated attack vectors. The PR-attack highlights critical vulnerabilities in RAG settings that require immediate attention from researchers and practitioners aiming to secure AI implementations against adversarial threats.

Markdown Report Issue