Adversarial Attack on Large Language Models using Exponentiated Gradient Descent

Published 14 May 2025 in cs.LG, cs.CL, and cs.CR | (2505.09820v1)

Abstract: As LLMs are widely used, understanding them systematically is key to improving their safety and realizing their full potential. Although many models are aligned using techniques such as reinforcement learning from human feedback (RLHF), they are still vulnerable to jailbreaking attacks. Some of the existing adversarial attack methods search for discrete tokens that may jailbreak a target model while others try to optimize the continuous space represented by the tokens of the model's vocabulary. While techniques based on the discrete space may prove to be inefficient, optimization of continuous token embeddings requires projections to produce discrete tokens, which might render them ineffective. To fully utilize the constraints and the structures of the space, we develop an intrinsic optimization technique using exponentiated gradient descent with the Bregman projection method to ensure that the optimized one-hot encoding always stays within the probability simplex. We prove the convergence of the technique and implement an efficient algorithm that is effective in jailbreaking several widely used LLMs. We demonstrate the efficacy of the proposed technique using five open-source LLMs on four openly available datasets. The results show that the technique achieves a higher success rate with great efficiency compared to three other state-of-the-art jailbreaking techniques. The source code for our implementation is available at: https://github.com/sbamit/Exponentiated-Gradient-Descent-LLM-Attack

Abstract PDF Upgrade to Chat

Summary

Adversarial Attack on Large Language Models using Exponentiated Gradient Descent

The increasing deployment of Large Language Models (LLMs) in various applications necessitates a deeper understanding of their security vulnerabilities, particularly their susceptibility to adversarial attacks. Despite methods such as reinforcement learning from human feedback (RLHF) aimed at aligning these models, LLMs remain vulnerable to jailbreaking, where adversaries craft inputs to elicit unintended or harmful outputs. This paper explores the design and implementation of an advanced adversarial attack method rooted in exponentiated gradient descent, coupled with Bregman projection, to effectively jailbreak prominent LLMs.

Core Contributions

The primary innovation in this paper centers around the intrinsic optimization technique leveraging exponentiated gradient descent (EGD) with Bregman projection. By optimizing the relaxed one-hot encoding of a model's tokens, the method guarantees that the representation remains within the probability simplex without requiring extrinsic projection efforts. Unlike other adversarial attack methods that either solely rely on discrete token manipulation or project continuous token embeddings to discrete tokens, this approach directly optimizes the token probability distributions, ensuring efficiency and effectiveness.

The authors present a detailed convergence analysis of the EGD technique, establishing theoretical guarantees under conditions of Lipschitz continuous gradient functions. Companion efforts involve the use of the Adam optimizer, augmented with entropic regularization and KL-divergence terms, promoting sparse and effective token representations, ultimately bolstering the attack's precision and computational efficiency.

Key Findings

The empirical evaluation showcases the robustness of the proposed model against five open-source LLMs, namely Llama2, Falcon, MPT, Mistral, and Vicuna, across diverse datasets such as AdvBench, HarmBench, JailbreakBench, and MaliciousInstruct. Noteworthy results include:

Higher Success Rates: The proposed method demonstrates consistently superior attack success rates compared to existing methods, achieving up to 60% attack success on models like Mistral, with non-negligible success across lesser-performing models.
Efficient Execution: The runtime complexity depicted below benchmarks the computational efficiency of the attack method, substantially outperforming baseline approaches by providing faster convergence to adversarial encodings.

The findings suggest significant implications for both model developers and security researchers in considering the inherent vulnerabilities of LLMs to adversarial perturbations. The efficient optimization method underscores the importance of rigorous pre-deployment testing to enhance model resilience.

Implications and Future Work

The research underscores the necessity for refining alignment techniques in LLMs to mitigate adversarial risks effectively. The ability to jailbreak models using optimized adversarial suffixes highlights potential risks in deploying LLMs in sensitive environments, driving a need for advanced securing mechanisms.

Future work may explore the transferability and universality of adversarial attacks across different models and behaviors, posing challenges to both attackers and defenders in the realm of cybersecurity and applied AI. This investigation paves the way for exploring automated harm detection frameworks and the development of robust LLM architectures less susceptible to adversarial exploits.