Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Published 27 Jun 2025 in cs.CL, cs.AI, cs.CR, and cs.LG | (2506.21972v1)

Abstract: The advancement of Pre-Trained LLMs (PTLMs) and LLMs has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. Token-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. To address the complementary limitations of these methods, we propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and Llama models. GCG + PAIR consistently raised attack-success rates over its constituent techniques on undefended models; for instance, on Llama-3, its Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR's 58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of WordGame maintaining a high ASR of over 80% even under stricter evaluators like Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and reliably pierced advanced defenses such as Gradient Cuff and JBShield, which fully blocked single-mode attacks. These findings expose previously unreported vulnerabilities in current safety stacks, highlight trade-offs between raw success and defensive robustness, and underscore the need for holistic safeguards against adaptive adversaries.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel hybrid approach that integrates token-level (GCG) and prompt-level (PAIR/WordGame) techniques to substantially improve jailbreak attack success rates.
The authors demonstrate that the GCG + PAIR method increases ASR from 58.4% to 91.6% on models like Llama-3 by combining gradient optimization with iterative prompt refinement.
The evaluation against advanced defenses like Gradient Cuff and JBShield highlights critical vulnerabilities in LLM safety measures, urging the development of more robust defenses.

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Introduction

The paper presents advanced strategies for overcoming the vulnerabilities present in LLMs. It identifies two primary attack methods—token-level and prompt-level jailbreaks—employed to bypass existing safety mechanisms. Token-level attacks often rely on gradient-based optimization to embed adversarial sequences, while prompt-level attacks use semantically structured inputs. Both techniques have intrinsic limitations that this paper aims to overcome by proposing hybrid approaches integrating their strengths.

Hybrid Approaches

The authors introduce two hybrid strategies: GCG + PAIR and GCG + WordGame. These methods were designed to enhance attack success rates by combining gradient-based token optimization with semantic prompt engineering.

GCG + PAIR: This approach marries the Greedy Coordinate Gradient (GCG) algorithm with the Prompt Automatic Iterative Refinement (PAIR) method. GCG optimizes adversarial suffixes at a token level, while PAIR uses iterative feedback to refine prompts semantically. On models such as Llama-3, GCG + PAIR achieved an Attack Success Rate (ASR) of 91.6%, compared to 58.4% with PAIR alone.
Figure 1: The GCG+PAIR attack workflow for automated jailbreaking. The system uses a GCG-based suffix generator and a PAIR optimization loop, which leverages an attacker LLM and a judge LLM (Llama-Guard) to iteratively craft an adversarial prompt that bypasses a target LLM's safety filters.
GCG + WordGame: This method incorporates gamification through masking strategies in tandem with GCG, maintaining high ASR even under state-of-the-art defenses.
Figure 2: The Workflow of WordGame + GCG. First, an LLM extracts malicious words, creating a masked prompt and corresponding hints. This game is then combined with the original attack goal to form a WordGame Attack Prompt. A GCG Suffix Generator optimizes an adversarial suffix specifically for this game-like prompt.

Evaluation and Results

Performance on LLMs

The hybrids were benchmarked on open-source models like Vicuna-7B and Llama models using the SorryBench dataset, designed to evaluate safety mechanisms in LLMs. The dataset includes a diverse set of prompts intended to elicit harmful responses. Both hybrids outperformed established attack methods, such as PAIR and GCG, when tested on models like LLaMA and GPT-3.

Defense Mechanisms

The paper evaluates the new hybrid methods against two advanced defenses: Gradient Cuff and JBShield. Remarkably, these hybrids can breach defenses that entirely block either token-level or prompt-level attacks when deployed in isolation. For instance, GCG + PAIR achieved a jump from 58.4% to 91.6% ASR against Llama-3, indicating a significant improvement and uncovering shortfalls in existing defenses.

Figure 3: The framework of the GCG + PAIR attack showcases an integrated method of attack generation, execution, and evaluation.

Practical and Theoretical Implications

The proposed hybrid strategies highlight the necessity for multi-faceted defense mechanisms against LLM attacks. The ability of GCG + PAIR and GCG + WordGame to pierce through modern state-of-the-art defenses exposes vulnerabilities and implies that these models might not yet be robust enough for secure deployment in sensitive settings such as healthcare and finance.

Future Outlook

The research presents a case for developing advanced defense mechanisms against these novel hybrid attacks. Potential future directions include:

Evaluating established defenses against hybrid attack strategies.
Extending studies to include closed-source models.
Building more sophisticated defense systems utilizing ensemble methods to detect multi-layered attack vectors.
Development of an automated framework for creating and testing adversarial jailbreaks.

Conclusion

This work contributes comprehensive strategies by integrating token-level and prompt-level jailbreak approaches to exploit the susceptibilities in LLM-based safety measures. The high success rates of these hybrids indicate inadequacies in modern defensive arrangements against adaptive adversaries, urging researchers and practitioners to consider the implications and build comprehensive and adaptive security architectures.

Markdown