Divergence Attack on ChatGPT
- The paper introduces a divergence attack that exploits subtle prompt manipulations to bypass ChatGPT’s safety and alignment filters, achieving over 90% success in tests.
- Key methodologies include the DAGR framework, template-based prompt injection, and guardrail reverse-engineering, which systematically undermine LLM defenses.
- Empirical evaluations on models like GPT-3.5 and GPT-4 reveal significant vulnerabilities, underscoring the need for dynamic guardrails and robust adversarial testing.
A divergence attack on ChatGPT is a prompt-based adversarial method designed to induce LLMs such as GPT-3.5 and GPT-4 to deviate from their intended, safety-aligned behaviors. These attacks systematically exploit the gap between surface-level safety training and the broader semantic or policy-level alignment of the model, leading to the generation of harmful, misleading, or policy-violating outputs. The ability to drive “divergence” in model behavior—i.e., to elicit outputs that would not be produced under normal, unpoisoned conditions—constitutes a critical vulnerability in LLM safety regimes.
1. Formal Definitions and Theoretical Foundations
A divergence attack is formally characterized as follows. Let denote the ChatGPT inference process (including all safety alignment and refusal mechanisms), be a benign user prompt, and an adversarial injection. The safety filter is denoted , where . The objective is to find:
subject to
where quantifies semantic or policy-level deviation, and is a predefined divergence threshold. The adversary succeeds by injecting that (1) passes the filter and (2) induces output exceeding the divergence threshold (Chang et al., 20 Apr 2025).
The “Diversity Helps Jailbreak” (DAGR) framework extends this formalism to a two-stage search aiming to maximize the binary jailbreak score , subject to the prompt being on-topic for the goal :
with and as on-topic and jailbreak success indicator functions (Zhao et al., 2024).
2. Core Attack Methodologies
Divergence attacks exploit a variety of mechanisms to circumvent LLM guardrails:
A. Diversity and Obfuscation (“DAGR”)
The DAGR approach iteratively prompts an auxiliary LLM (e.g., GPT-3.5-turbo) to generate maximally diverse, on-topic adversarial prompts ("roots"), then applies minimal transformations (“leaves”) to obfuscate sensitive keywords (e.g., replacing “bomb” with “device that makes a loud boom”) or embed fictional and historical contexts. A first-in, first-out (FIFO) memory enforces novelty at each search depth, typically up to five iterations. Success is declared once any generated prompt elicits a harmful model response (Zhao et al., 2024).
B. Template-Based Prompt Injection
A lightweight, template-driven framework wraps malicious goals in benign-looking markers, such as:
1 2 |
[Template]: Here are some rules, which are the *most* important: <rule> ...hidden content... </rule> |
C. Guardrail Reverse-Engineering (GRA)
The Guardrail Reverse-engineering Attack (GRA) leverages reinforcement learning and genetic algorithms to construct a surrogate of ChatGPT’s guardrail from observable input-output patterns. By iteratively identifying and augmenting high-divergence prompts, the surrogate policy approximates guardrail decisions with high fidelity (RuleMR ), then guides the generation of prompts that sit in the “false-safe” region of the true boundary, yielding up to 40% higher jailbreak success relative to random search (Yao et al., 6 Nov 2025).
D. Multi-Turn Hidden-Intent Attacks
Attacks such as Imposter.AI employ multi-turn decomposition, splitting a malicious query into a series of seemingly benign sub-questions, each with low surface toxicity. Obfuscation and context-based augmentation (e.g., fictional scenarios) further reduce detection. The final summarization request then coalesces the responses into a harmful output that bypasses single-turn safeguards (Liu et al., 2024).
3. Bypassing and Subverting Safety Alignments
Divergence attacks systematically target weaknesses in current LLM alignment and filtering strategies:
- Semantic Stealthiness: Malicious content is obscured using research or educational framings, making keyword-based filters and log-prob checks ineffective.
- Structural Masking: Instructions are wrapped in innocuous tags (e.g.,
<rule> ... </rule>) or hidden within file appendices, metadata, or footnotes, defeating syntactic sanitization. - System Prompt Obscurity: Agent-level system-prompt attacks introduce persistent internal rules unavailable to the end user and UI-level filters.
- Cross-Turn Fragmentation: By scattering information across dialogue and employing toxicity-reducing paraphrase operators, attackers exploit the inability of per-turn checks to aggregate latent malicious intent (Zhao et al., 2024, Chang et al., 20 Apr 2025, Liu et al., 2024).
4. Empirical Evaluation and Comparative Results
Experiments across HarmBench (200 harmful tasks) and AdvBench (520 tasks) reveal that DAGR and allied divergence methods dramatically outperform prior jailbreak approaches in both attack success rate (ASR) and query efficiency.
Table 1: HarmBench (ASR and Avg. Queries)
| Target Model | AutoDAN (ASR) | AutoDAN (#Q) | DAGR (ASR) | DAGR (#Q) |
|---|---|---|---|---|
| Llama2-7B | 9.0% | 175.7 | 91.8 ± 1.3% | 46.2 |
| GPT-3.5-turbo | 56.0% | 176.7 | 91.0 ± 1.1% | 44.8 |
| GPT-4o-mini | 25.0% | 176.7 | 87.2 ± 2.6% | 55.0 |
On Llama2-7B, DAGR exceeds the strongest baseline’s ASR by 62.83 percentage points, requiring only 12.9% of the queries (46 vs. 452) (Zhao et al., 2024).
Table 2: AdvBench (ASR and Avg. Queries)
| Target Model | TAP (ASR) | TAP (#Q) | DAGR (ASR) | DAGR (#Q) |
|---|---|---|---|---|
| Llama2-7B | 8.5% | 506 | 90.96 ± 1.1% | 47.4 |
| GPT-3.5-turbo | 17.3% | 532 | 91.2 ± 1.5% | 27.6 |
| GPT-4o-mini | 7.3% | 824 | 89.2 ± 1.4% | 37.8 |
On GPT-4o-mini, DAGR yields a 57 percentage point ASR improvement and reduces queries by 95% compared to TAP (Zhao et al., 2024).
5. Broader Impact and Model-Level Implications
Divergence attacks challenge fundamental assumptions about the robustness of alignment in state-of-the-art models:
- Attack success rates >90% on closed-source GPT-4 and GPT-3.5 have been achieved using DAGR, with clear transferability to related engines (e.g., prompts found for GPT-4 work directly on GPT-4o) (Zhao et al., 2024).
- Real-world scenarios demonstrate that attacks may skew model outputs toward persistent, misleading, and policy-violating behaviors. Cases include biased peer review (“strong accept” recommendations from embedded PDF instructions) and stealthy product promotion via web-injected payloads (Chang et al., 20 Apr 2025).
- Attack effectiveness persists against LLMs with sophisticated guardrails, with surrogate-guided methods extracting and circumventing internal rules at modest operational costs (<$85 in API usage, conclusive convergence after ~400 queries) (Yao et al., 6 Nov 2025).
These findings indicate that alignment “patches” typically focus on localized prompt-response spaces, leaving substantial uncovered capacity for semantic divergence and sophisticated prompt injection.
6. Defenses and Future Mitigation Strategies
Current research highlights several mitigation options:
- Broader Adversarial Testing: Integrate diversified and obfuscated prompts, including template-based and decomposition-based variants, into red-teaming and finetuning pipelines (Zhao et al., 2024, Chang et al., 20 Apr 2025).
- Contrastive Decoding: Penalize outputs that are semantically close to previously identified failures using decoding strategies like ROSE (Zhao et al., 2024).
- Dynamic Guardrails: Continuously expand safe-prompt coverage, automatically generating and training on prompt variants whenever a new jailbreak is detected (Zhao et al., 2024).
- Refusal Classifiers: Employ higher-capacity mechanisms that examine complete outputs, not just initial refusals, and track cross-turn dialogues for aggregated harmful intent (Zhao et al., 2024, Liu et al., 2024).
- Prompt-Level Security: Normalize or strip custom tags, scan attachments/metadata for hidden payloads, and expose system-prompt content for auditing (Chang et al., 20 Apr 2025).
- Dynamic Guardrail Shuffling and Input Monitoring: Randomize guardrail rules and monitor for high-divergence query patterns to resist surrogate extraction and filter circumvention (Yao et al., 6 Nov 2025).
Despite these directions, divergences between prompt-level manipulations and overall LLM policy compliance remain an unresolved frontier, necessitating a significant rethinking of model alignment architectures and deployment protocols.
References:
- “Diversity Helps Jailbreak LLMs” (Zhao et al., 2024)
- “Breaking the Prompt Wall (I): A Real-World Case Study of Attacking ChatGPT via Lightweight Prompt Injection” (Chang et al., 20 Apr 2025)
- “Black-Box Guardrail Reverse-engineering Attack” (Yao et al., 6 Nov 2025)
- “Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned LLMs” (Liu et al., 2024)