Adversarial Attack on Large Language Models using Exponentiated Gradient Descent
The increasing deployment of Large Language Models (LLMs) in various applications necessitates a deeper understanding of their security vulnerabilities, particularly their susceptibility to adversarial attacks. Despite methods such as reinforcement learning from human feedback (RLHF) aimed at aligning these models, LLMs remain vulnerable to jailbreaking, where adversaries craft inputs to elicit unintended or harmful outputs. This paper explores the design and implementation of an advanced adversarial attack method rooted in exponentiated gradient descent, coupled with Bregman projection, to effectively jailbreak prominent LLMs.
Core Contributions
The primary innovation in this paper centers around the intrinsic optimization technique leveraging exponentiated gradient descent (EGD) with Bregman projection. By optimizing the relaxed one-hot encoding of a model's tokens, the method guarantees that the representation remains within the probability simplex without requiring extrinsic projection efforts. Unlike other adversarial attack methods that either solely rely on discrete token manipulation or project continuous token embeddings to discrete tokens, this approach directly optimizes the token probability distributions, ensuring efficiency and effectiveness.
The authors present a detailed convergence analysis of the EGD technique, establishing theoretical guarantees under conditions of Lipschitz continuous gradient functions. Companion efforts involve the use of the Adam optimizer, augmented with entropic regularization and KL-divergence terms, promoting sparse and effective token representations, ultimately bolstering the attack's precision and computational efficiency.
Key Findings
The empirical evaluation showcases the robustness of the proposed model against five open-source LLMs, namely Llama2, Falcon, MPT, Mistral, and Vicuna, across diverse datasets such as AdvBench, HarmBench, JailbreakBench, and MaliciousInstruct. Noteworthy results include:
- Higher Success Rates: The proposed method demonstrates consistently superior attack success rates compared to existing methods, achieving up to 60% attack success on models like Mistral, with non-negligible success across lesser-performing models.
- Efficient Execution: The runtime complexity depicted below benchmarks the computational efficiency of the attack method, substantially outperforming baseline approaches by providing faster convergence to adversarial encodings.
The findings suggest significant implications for both model developers and security researchers in considering the inherent vulnerabilities of LLMs to adversarial perturbations. The efficient optimization method underscores the importance of rigorous pre-deployment testing to enhance model resilience.
Implications and Future Work
The research underscores the necessity for refining alignment techniques in LLMs to mitigate adversarial risks effectively. The ability to jailbreak models using optimized adversarial suffixes highlights potential risks in deploying LLMs in sensitive environments, driving a need for advanced securing mechanisms.
Future work may explore the transferability and universality of adversarial attacks across different models and behaviors, posing challenges to both attackers and defenders in the realm of cybersecurity and applied AI. This investigation paves the way for exploring automated harm detection frameworks and the development of robust LLM architectures less susceptible to adversarial exploits.